CPSC 418: Assignment #5

CPSC 418 Home Page
------------------------------------------------------------------------------
			HOMEWORK ASSIGNMENT #5

Out: Thurs Mar 27/97					Due: Fri Apr 11/97
CS 418							Spring Session 1997
Marks: 100
------------------------------------------------------------------------------

Question 1: Cache misses (25 marks)

Consider a two level memory hierarchy (L1 cache and main memory).  The cache
has the following parameters:

	unified instruction and data
	8 KBytes
	2 way set associative
	4 words / block
	byte addressed
	LRU replacement
	90% hit rate
	33% of cache accesses are writes
	write back with write allocate
	25% of cache contents are dirty at any one time
	1 cycle cache hit access time
	transfer between registers and cache: 1 word/cycle
	10 cycle main memory access time
	transfer between cache and main memory: 2 words/cycle
	32 bit physical address

Notes:

- The transfer from cache to registers can be overlapped with the
  cache access for a read hit, but the cache access and then transfer
  from registers to cache occurs sequentially for a write hit.

- There is no overlap of access with transfer for main memory.

- On a read miss, data may be delivered to the processor the cycle
  after it appears in the cache (so the cache to register transfer
  overlaps with the second cycle of memory to cache transfer).

- On a write allocate, the write may not proceed until the transfer
  from main memory is complete.

- The write of an evicted dirty block to main memory must occur before
  the read of the evicting block, and no access or transfer overlap is
  possible during the write.

----------------------------------- 
a) Show the division of physical address into tag, index, and offset for this
cache.

b) The cache needs the following overhead bits:
	? tag bits/block (from part (a))
	1 dirty bit/block
	1 valid bit/block
	1 MRU bit/set (LRU is opposite of MRU for 2-way)
What percentage of the bits stored in the cache are overhead bits?

c) With pictures, words and/or pseudo-code, describe how the cache detects a
cache miss and handles the subsequent fill.  Assume that the block being
replaced is a dirty block.  How many cycles are required?  (Good example
diagrams of caches are PH fig 7.9 p.470, PH fig 7.27 p.509, or HP figure 8.11
p.415)

d) What is the average access time for the cache? (hint: there are 6 different
cases that can occur when the cache is accessed)

------------------------------------------------------------------------------






Question 2: Addressing (30 marks)

Consider the following interleaved memory structure (using old-fashioned
DRAM):

	4 bytes/word
	byte addressing (but all addresses must be word aligned)
	32 byte L2 cache line
	8 byte wide data bus to cache
	4 banks of memory
	memory bus at ~50 MHz
	processor bus at 200MHz
	60 ns RAS time
	30 ns CAS time
	150 ns cycle time
	1M x 4 DRAM chips
	32 MBytes of memory (25 bit physical address)

a) How many memory chips are needed in each bank?

b) What are each of the 25 address lines used for (there are at least three
different groups of lines)?

c) Given the bus width, number of banks, and chip type, could this memory
system be built with less than 32 MBytes?  If yes, what is the minimum?  If
no, why not?

d) Answer (c) if the system were built with 1M x 16 DRAM chips.

e) Answer (c) if the system were built with 512K x 8 DRAM chips. 

f) Assuming that one memory bus cycle is required to transmit the addresses to
all chips, and one memory bus cycle is require to return a bus width of data
(the data from 1 memory bank), how many processor cycles does it take to fill
an L2 cache line (including memory access time)?

g) If the L2 cache line were doubled in size, it would require two passes
through all the memory banks to fill one line.  How many processor cycles
would the doubled block take to fill (don't forget regular DRAM cycle time)?

h) By careful assignment of address lines, the two passes of (g) could be
arranged to fall on sequential addresses.  Ignoring the possibility of an
access crossing into a different row, redo the calculation for (g) using FPM
(or EDO) DRAM.

------------------------------------------------------------------------------

Question 3: Non-blocking Caches (25 marks)

In an effort to validate the results presented by Sohi & Franklin, I have
performed the following experiment:

I have a trace of the memory references generated by running cc on the DLX
architecture.  Each memory reference is either an instruction fetch, a load or
a store.  I can count the number of instructions fetched (and thus executed)
between loads/stores.  Below is a table showing percentages for a large number
of memory references (~1 M of references, approximately 25% of which were
loads/stores):

#instr	1	2	3	4	5	6	7	8	9
ratio	0.136	0.047	0.034	0.039	0.021	0.011	0.007	0.007	0.003

#instr	10	11	12	13	14	15	16	17+
ratio	0.004	0.003	0.001	0.001	0.001	0.001	0.001	0.000

where ratio is the ratio of all instructions executed in which a load/store
was executed, and the next load/store executed was exactly #instr later.

For example, the table shows that for all instructions executed, a pair of
load/stores occurred exactly 1 instruction apart 13.6% of the time (i.e. one
load/store followed another immediately).  In 0.4% of the instructions, there
was a load/store whose nearest subsequent load/store was exactly 10
instructions away.  Finally, there is never more than 16 instructions between
load/stores

The first case indicates that, on a cache miss, a blocking cache will force
the processor to stall on the very next instruction 13.6% of the time.  The
number of instructions lost due to this stall will depend on the base CPI and
the length of a cache miss.

a) Based on these numbers, calculate the performance loss (% change in CPI)
due to memory references for a blocking cache on a modern superscalar
architecture.  Assume a quad issue machine which can attain a base 3
instr/cycle (including non-memory dependencies and branch handling) and can
continue to issue and execute instructions during a cache miss, as long as a
second cache access does not occur.  The L1 dcache has a miss rate of 10%, and
the average access time to L2 is 10 cycles.

b) Comment on my experimental methodology (including my assumptions) in 1/2 to
1 page.  Do you think the results calculated in (a) accurately reflect the
disadvantages of a blocking cache?

------------------------------------------------------------------------------

Question 4: The Big Kahuna ISA -- x86 (20 marks)

NOTE:  Do any two of the following three parts:

a) The Cyrix M1 uses register renaming and slightly out-of-order issue to
achieve better performance ratings despite a lower clock rate (even with a
deeper pipeline, the larger IC process restricts the speed of the M1).  In a
paragraph, describe an application which would benefit substatially from any
M1 innovation(s) and run much faster on an M1 than a Pentium.  Explain your
reasoning.

b) The Intel P6 uses extensive out-of-order issue and register renaming to
outperform all its competitors.  In a paragraph, describe an application which
would benefit substatially from any P6 innovation(s) and run much faster on a
P6 than a Pentium or M1.  Explain your reasoning.

c) Despite all the hype, some code just doesn't run much faster.  In a
paragraph, describe an application in which the innovations of the M1, K5 and
P6 have not made a significant impact in speed -- a fast Pentium will do
almost as well.  Explain your reasoning.

Note: The general "16-bit code" problem is not an appropriate answer to this
part, since it effects only the P6, and is caused by a variety of
optimizations for 32-bit code.

------------------------------------------------------------------------------

Question 5: Evaluation

For each of the questions 1-4, how long did it take to answer?

Do you have any other comments about the assignment?

------------------------------------------------------------------------------
CPSC 418 Home Page
Last modified: Apr 97