CPSC 418: Assignment #5 Solutions

CPSC 418 Home Page

CPSC 418 - Assignment #5 Solution

1.
a) offset:
	4 words/block = 16 bytes/block = 2^4 bytes/block
	-> 4 bit offset

   index:
	#blocks/bank = cache size/(block size * set assoc.)
	             = 8192/(16*2)
	             = 256 = 2^8
	-> 8 bit index

   tag:
	#tag bits = 32 - #offset bits - #index bits
	          = 32 - 4 - 8
	          = 20 bits
	-> 20 bit tag
--------------------------
b) #overhead bits/block = #tag bits + #dirty bit + #valid bit
			= 20 + 1 + 1
			= 22 bits

   #overhead bits/set = (#overhead bits/block)*set assoc. + #MRU bit
		      = 22*2 + 1
		      = 45 bits

   total # bits/set = #overhead bits/set + #data bits
		    = 45 + 2 * 4 words/block * 4 bytes/word * 8 bits/byte
		    = 45 + 256
		    = 301 bits

   % of overhead bits = 45/301 = 15%

-----------------------

c) For the explanation, refer to Solution for question (1a) of HW#8 from Spring 1996. In addition to the description there, the displaced cache line must be written back to main memory because it is dirty. Between deciding which cache line to evict and loading the new line, the cache will have to send the dirty line to memory: construct the appropriate address from the line tag and cache set, send the address to the MMU, and write the data over the bus.


Cycles required = Cache Access + Write Block to Mem +
                  Read Block from Mem + Cache Access
                = 1 + (10 + 2) + (10 + 2) + 1
                = 26 cycles

-----------------------

d) Here are the six cases:
	cache hit read 
	cache hit write
	cache miss read with clean eviction
	cache miss read with dirty eviction
	cache miss write requiring write allocate with clean eviction
	cache miss write requiring write allocate with dirty eviction

T_Avg = (%Hit1)((%read)T_Acc1 + (%write)(T_Acc1 + T_Write1)) +
        (%Miss1)((%read)(T_Acc1 + T_Fill1 + (%Dirty1)T_WB1) +
                 (%write)(T_Acc1 + (T_Fill1 + T_Write1) + (%Dirty1)T_WB1))
      = 0.90(0.67(1) + 0.33(1 + 1)) +
        0.10(0.67(1 + 13 + 0.25(12)) +
             0.33(1 + (13 + 1) + 0.25(12)))
      = 1.197 + 1.733
      = 2.93

------------------------------------------------------------------------

2.
a) With 32MBytes of memory divided into 4 banks, each bank is 8MBytes.
   So we will need (8MBytes * 8 bits/byte) / 4Mbit = 16 chips.

b) 10 lines are row addresses for chips, 10 lines are column (those
   two are the same for all banks), 2 choose the bank, and 3 choose the
   byte within the bank.

c) No, with a 64 bit wide bus, you will need 64/4 = 16 chips per bank.
   With 4 banks, you will get 4Mbit * 16 * 4 = 256Mbit = 32MByte.

d) No, with a 64 bit wide bus, you will need 64/16 = 4 chips per bank.
   With 4 banks, you will get 16Mbit * 4 * 4 = 256Mbit = 32MByte.

e) Yes, with a 64 bit wide bus, you will need 64/8 = 8 chips per bank.
   With 4 banks, you will get 4Mbit * 8 * 4 = 64Mbit = 16MBytes.

f) #processor cycles to fill L2 cache line =
      ((#cycle_transmit_addr + #cycle_for_data * #banks)*(memory cycle time) +
      RAS time + CAS time) / processory cycle time

     = ((1 + 1*4)*(1/50MHz * 10^9) + 60 + 30) / (1/200MHz * 10^9)
     = 190 / 5
     = 38 processor cycles

g) #processor cycles to fill L2 cache line = 
     = ((2*(1 + 1*4)*(1/50MHz * 10^9)) + 60 + 30 + 150) / (1/200MHz * 10^9)
     = 440 / 5
     = 88 processor cycles
		NOTE: cycle time includes RAS and CAS for first access.

h) Same as in (g), but FPM allows multiple CAS for every RAS.
Therefore, there is no full cycle time needed, nor a second RAS:
     = ((2*(1 + 1*4)*(1/50MHz * 10^9)) + 60 + 30 + 30) / (1/200MHz * 10^9)
     = 320 / 5
     = 64 processor cycles

--------------------------------------------------------------------------

Question 3:

a) (12 marks)

Adding up the percentages in the table, load/stores are 31.7% of
instructions in this trace.  Cache misses are ~10% of load/stores.
Therefore 0.317 * 0.1 = 0.0317 of instructions miss in the L1 cache.

After a miss, the penalty depends on how long until the next access.
With a base CPI of 1/3 and a miss penalty of 10 cycles, the penalty
for a miss on the very next instruction will be (subtract 1/3 of a
cycle to execute that second load) 10 - 1/3 = 9 2/3 cycles.  Average
L1 miss penalty will be:

	  0.136 * 29/3 + 0.047 * 28/3 + 0.034 * 27/3
	+ 0.039 * 26/3 + 0.021 * 25/3 + 0.011 * 24/3
	+ 0.007 * 23/3 + 0.007 * 22/3 + 0.003 * 21/3
	+ 0.004 * 20/3 + 0.003 * 19/3 + 0.001 * 18/3
	+ 0.001 * (17 + 16 + 15 + 14) / 3		= 2.859 cycles

So CPI penalty will be 
	0.0317 miss / instr * 2.859 cycles / miss = 0.0906 cycles / instr

The processor may be an advanced out-of-order issue (to enable it to
continue to execute instructions during a cache miss), but the
blocking cache increases its CPI by (0.0953 / 0.33333) = 27.2%

This experiment would suggest that adding a non-blocking cache would
remove a significant bottleneck in this processor's performance.

b) (5 for each of at least 2 methodology problems, and 3 for drawing a
    conclusion about the analysis)

Problems with my methodology and conclusions:

- what if cc is not the type of program you normally run.  Maybe it
  has more or fewer load/stores which may be closer/farther apart than
  the average program.  Fixing this problem requires collecting data
  on more traces.

- the only instruction dependency playing a factor in this analysis is
  the single cache port.  It could be that cache misses already cause
  a large number of stall cycles, just because subsequent instructions
  need the result of the miss.  Adding a non-blocking cache would not
  help these cases -- such cache miss penalty cycles should not be
  counted as caused by the blocking cache but rather instruction
  dependencies.  Correcting this problem will require a superscalar
  simulation of the processor's execution.

- the non-blocking cache so far discussed is merely able to satisfy
  more cache hits while a cache miss is being handled in the background.
  My analysis takes not account of multiple cache misses occurring while
  the first is being serviced.  Nor does the analysis take account of an
  access to a word on the same line as the first miss.  Correcting these
  inadequacies in the experiment would require a cache simulator.

There are probably even more problems with my experiment.

From these three complaints, my experiment both inflates the penalty
due to the blocking cache, and inflates the expected benefits of a
non-blocking cache.  Futhermore I did not even come close to a
thorough examination of user application behaviour.  My experiment is
not a proper analysis of the disadvantages of a blocking cache.

--------------------------------------------------------------------------

Question 4: (10 points each)

The merits of each answer will have to be judged individually.  A good
answer will describe a user application with some specific behaviour,
and how the particular processor is well designed to handle that
behaviour.

As an example (student's answers should be longer), an application
with a heavily used working set larger than 8KB but smaller than 256KB
would benefit substantially from the large fast L2 cache on the P6.
An example of such an application might be a scrollable terminal
window with a few hundred lines of text in which the user is scrolling
back and forth.

--------------------------------------------------------------------------

CPSC 418 Home Page

Last modified: Apr 97