CPSC 418: Assignment #5 Solutions
CPSC 418 - Assignment #5 Solution
1.
a) offset:
4 words/block = 16 bytes/block = 2^4 bytes/block
-> 4 bit offset
index:
#blocks/bank = cache size/(block size * set assoc.)
= 8192/(16*2)
= 256 = 2^8
-> 8 bit index
tag:
#tag bits = 32 - #offset bits - #index bits
= 32 - 4 - 8
= 20 bits
-> 20 bit tag
--------------------------
b) #overhead bits/block = #tag bits + #dirty bit + #valid bit
= 20 + 1 + 1
= 22 bits
#overhead bits/set = (#overhead bits/block)*set assoc. + #MRU bit
= 22*2 + 1
= 45 bits
total # bits/set = #overhead bits/set + #data bits
= 45 + 2 * 4 words/block * 4 bytes/word * 8 bits/byte
= 45 + 256
= 301 bits
% of overhead bits = 45/301 = 15%
-----------------------
c) For the explanation, refer to
Solution for question (1a) of HW#8 from Spring 1996. In addition
to the description there, the displaced cache line must be written
back to main memory because it is dirty. Between deciding which cache
line to evict and loading the new line, the cache will have to send
the dirty line to memory: construct the appropriate address from
the line tag and cache set, send the address to the MMU, and write
the data over the bus.
Cycles required = Cache Access + Write Block to Mem +
Read Block from Mem + Cache Access
= 1 + (10 + 2) + (10 + 2) + 1
= 26 cycles
-----------------------
d) Here are the six cases:
cache hit read
cache hit write
cache miss read with clean eviction
cache miss read with dirty eviction
cache miss write requiring write allocate with clean eviction
cache miss write requiring write allocate with dirty eviction
T_Avg = (%Hit1)((%read)T_Acc1 + (%write)(T_Acc1 + T_Write1)) +
(%Miss1)((%read)(T_Acc1 + T_Fill1 + (%Dirty1)T_WB1) +
(%write)(T_Acc1 + (T_Fill1 + T_Write1) + (%Dirty1)T_WB1))
= 0.90(0.67(1) + 0.33(1 + 1)) +
0.10(0.67(1 + 13 + 0.25(12)) +
0.33(1 + (13 + 1) + 0.25(12)))
= 1.197 + 1.733
= 2.93
------------------------------------------------------------------------
2.
a) With 32MBytes of memory divided into 4 banks, each bank is 8MBytes.
So we will need (8MBytes * 8 bits/byte) / 4Mbit = 16 chips.
b) 10 lines are row addresses for chips, 10 lines are column (those
two are the same for all banks), 2 choose the bank, and 3 choose the
byte within the bank.
c) No, with a 64 bit wide bus, you will need 64/4 = 16 chips per bank.
With 4 banks, you will get 4Mbit * 16 * 4 = 256Mbit = 32MByte.
d) No, with a 64 bit wide bus, you will need 64/16 = 4 chips per bank.
With 4 banks, you will get 16Mbit * 4 * 4 = 256Mbit = 32MByte.
e) Yes, with a 64 bit wide bus, you will need 64/8 = 8 chips per bank.
With 4 banks, you will get 4Mbit * 8 * 4 = 64Mbit = 16MBytes.
f) #processor cycles to fill L2 cache line =
((#cycle_transmit_addr + #cycle_for_data * #banks)*(memory cycle time) +
RAS time + CAS time) / processory cycle time
= ((1 + 1*4)*(1/50MHz * 10^9) + 60 + 30) / (1/200MHz * 10^9)
= 190 / 5
= 38 processor cycles
g) #processor cycles to fill L2 cache line =
= ((2*(1 + 1*4)*(1/50MHz * 10^9)) + 60 + 30 + 150) / (1/200MHz * 10^9)
= 440 / 5
= 88 processor cycles
NOTE: cycle time includes RAS and CAS for first access.
h) Same as in (g), but FPM allows multiple CAS for every RAS.
Therefore, there is no full cycle time needed, nor a second RAS:
= ((2*(1 + 1*4)*(1/50MHz * 10^9)) + 60 + 30 + 30) / (1/200MHz * 10^9)
= 320 / 5
= 64 processor cycles
--------------------------------------------------------------------------
Question 3:
a) (12 marks)
Adding up the percentages in the table, load/stores are 31.7% of
instructions in this trace. Cache misses are ~10% of load/stores.
Therefore 0.317 * 0.1 = 0.0317 of instructions miss in the L1 cache.
After a miss, the penalty depends on how long until the next access.
With a base CPI of 1/3 and a miss penalty of 10 cycles, the penalty
for a miss on the very next instruction will be (subtract 1/3 of a
cycle to execute that second load) 10 - 1/3 = 9 2/3 cycles. Average
L1 miss penalty will be:
0.136 * 29/3 + 0.047 * 28/3 + 0.034 * 27/3
+ 0.039 * 26/3 + 0.021 * 25/3 + 0.011 * 24/3
+ 0.007 * 23/3 + 0.007 * 22/3 + 0.003 * 21/3
+ 0.004 * 20/3 + 0.003 * 19/3 + 0.001 * 18/3
+ 0.001 * (17 + 16 + 15 + 14) / 3 = 2.859 cycles
So CPI penalty will be
0.0317 miss / instr * 2.859 cycles / miss = 0.0906 cycles / instr
The processor may be an advanced out-of-order issue (to enable it to
continue to execute instructions during a cache miss), but the
blocking cache increases its CPI by (0.0953 / 0.33333) = 27.2%
This experiment would suggest that adding a non-blocking cache would
remove a significant bottleneck in this processor's performance.
b) (5 for each of at least 2 methodology problems, and 3 for drawing a
conclusion about the analysis)
Problems with my methodology and conclusions:
- what if cc is not the type of program you normally run. Maybe it
has more or fewer load/stores which may be closer/farther apart than
the average program. Fixing this problem requires collecting data
on more traces.
- the only instruction dependency playing a factor in this analysis is
the single cache port. It could be that cache misses already cause
a large number of stall cycles, just because subsequent instructions
need the result of the miss. Adding a non-blocking cache would not
help these cases -- such cache miss penalty cycles should not be
counted as caused by the blocking cache but rather instruction
dependencies. Correcting this problem will require a superscalar
simulation of the processor's execution.
- the non-blocking cache so far discussed is merely able to satisfy
more cache hits while a cache miss is being handled in the background.
My analysis takes not account of multiple cache misses occurring while
the first is being serviced. Nor does the analysis take account of an
access to a word on the same line as the first miss. Correcting these
inadequacies in the experiment would require a cache simulator.
There are probably even more problems with my experiment.
From these three complaints, my experiment both inflates the penalty
due to the blocking cache, and inflates the expected benefits of a
non-blocking cache. Futhermore I did not even come close to a
thorough examination of user application behaviour. My experiment is
not a proper analysis of the disadvantages of a blocking cache.
--------------------------------------------------------------------------
Question 4: (10 points each)
The merits of each answer will have to be judged individually. A good
answer will describe a user application with some specific behaviour,
and how the particular processor is well designed to handle that
behaviour.
As an example (student's answers should be longer), an application
with a heavily used working set larger than 8KB but smaller than 256KB
would benefit substantially from the large fast L2 cache on the P6.
An example of such an application might be a scrollable terminal
window with a few hundred lines of text in which the user is scrolling
back and forth.
--------------------------------------------------------------------------
Last modified: Apr 97