CPSC 418, Problem Set #2, Due 10am, February 24, 1998

Problem 1:  Precise Exceptions (5 pts)

In class, we saw the contortions required to guarantee precise exception
handling in a pipelined or superscalar processor.  Consider the following
simple scheme to handle exceptions.  Whenever an exception occurs, stop
fetching any more instructions.  Then, let all instructions that are already
in the processor finish executing.  Now, every instruction that has been
fetched has completed, and every instruction that hasn't been fetched hasn't
been completed, so we can just set the saved PC to point to the last
instruction fetched.  Does this method guarantee precise exceptions?
Why or why not?

Answer:

No.  Although we do satisfy the two conditions that everything before the
saved PC has completed, and everything after the saved PC hasn't started,
the saved PC does not necessarily point to the instruction that caused
the exception.  Furthermore, some exceptions (e.g. page fault) don't
permit the instruction to finish, so we can't always wait for everything
in the processor to complete.

Problem 2:  Branch Prediction

Assume floating-point registers f0..f31, integer registers r0..r31.
fsub, fdiv, fadd are single-precision floating point instructions.

Initially, assume the registers have the following values:
	f0 = 0.0
	f1 = 1.0
	f2 = 2.0
	f3 = 999999.0
	r1 = 999999

Consider this code fragment:
	loop:
	L1:	andi	r2, r1, #3	-- r2 = r1^0x11 (r2 = r1 mod 4)
	L2:	beq	r2, #3, minus	-- if r2==3 goto minus:
	L3:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	L4:	fadd	f0, f0, f4	-- f0 += f4
	L5	br	next		-- goto next:
	minus:
	L6:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	L7:	fsub	f0, f0, f4	-- f0 -= f4
	next:
	L8:	fsub	f3, f3, f2	-- f3 -= 2.0;
	L9:	subi	r1, r1, #2	-- r1 -= 2;
	L10:	bge	r1, #0, loop	-- if r1>0 goto loop:

Note that there are three branches:  L2, L5, and L10
(a)  How many times is each branch executed? (5 pts)
	L2 is 500000 times.  L5 is 250000 times.  L10 is 500000 times.
(b)  If we use BTFN, what percentage do we predict correctly.  (Include the
	correct "predictions" of L5.) (5 pts)
	L2 is correct 250000 times.  L5 is correct 250000 times.
	L10 is correct 499999 times.  Therefore, prediction rate is
	999999/1250000, or just under 80%.

	OR

	If you assumed that the unconditional branch is mispredicted
	every time because it's forward, you get 749999/1250000,
	or just under 60%

(c)  If we use 2-bit saturating counters for each branch, initialized
	to 10 (weakly predict taken) for backward branches and
	01 (weakly predict not taken) for forward branches, what will
	be the counter values for each branch after the loop terminates? (5 pts)
	L2 will be 01.  L5 will be 11.  L10 will be 10.
(d)  What is the prediction rate for part (c)? (5 pts)
	L2 will be mispredicted every time.  L5 will be correct everytime
	except the first time (if you include L5 in the prediction mechanism).
	L10 will be correct 499999 times.  Therefore, prediction rate
	is 749998/1250000, or just under 60%.

	OR

	If you assumed that L5 gets initialized to 11, since it's unconditional
	then you get 749999/1250000.
(e)  Suppose we use a 2-level branch predictor, with a separate 3-bit
	history register for each branch (initialized to 000), and
	a single, global pattern table for all branches (with all entries
	initialized to 10 weakly predict taken).  Show the history registers
	and pattern table after one pass through the loop (including
	executing L10). (10 pts)

	L2 was 000, becomes 100 (taken branch).  (Pattern 000 becomes 11.)
	L5 doesn't get executed, stays 000.
	L10 was 000, becomes 100 (taken branch).  (Pattern 000 stays 11.)
	Therefore, pattern table is still 10 for everything, except entry
	000 is 11.
(f)  Show the history registers and pattern table after another pass through
	the loop. (10 pts)

	L2 was 100, becomes 010 (not taken).  (Pattern 100 becomes 01.)
	L5 was 000, becomes 100 (taken).  (Pattern 000 stays 11.)
	L10 was 100, becomes 110 (taken).  (Pattern 100 becomes 10.)
	Therefore, pattern table is still 10 for everything, except entry
	000 is 11.

(g)  In the preceding two iterations, the two-level scheme correctly
	predicted 3 out of 5 branches -- not very good.  As the loop
	continues, will the two-level scheme perform well or poorly?
	Justify your answer.  (10 pts)

	It will perform well.  We were just seeing start-up transients.
	As the loop continues, L5 and L10 will both have history registers
	equal to 111, and the pattern for 111 will be 11, correctly
	predicting those branches.  In addition, the history register for
	L2 will alternate between 010 and 101.  The pattern registers
	for 010 will be 11, and for 101 will be 00, so L2 will be correctly
	predicted for almost every iteration.  The total prediction rate
	will be almost 100%.

Problem 3:  Non-Blocking and Victim Caches

For this problem, we will combine the banked, non-blocking caches from
the Sohi and Franklin paper, with the victim cache idea presented in
the Jouppi and Wilton paper.

Assume a machine with 32-bit addresses, byte addressable.  We have a
16KByte L1 data cache, divided into 4 banks of 4KBytes each.  The L1
cache is direct-mapped.  The L1 is also non-blocking, with 1 MSHR
per bank.  Backing up the L1 is an 8 entry, fully-associative
victim cache, with a least-recently used eviction policy.
Cache lines are 16 bytes each in both the L1 and L2 caches.
Assume that the caches are initialized to garbage values, so all
accesses will miss until we put something there.  

(a)  How many bits of the 32-bit address are devoted to tag, cache line
	index, bank select, and byte offset, respectively, when looking
	up an address in the L1.  For the remainder of this problem,
	assume the address fields are in that order (i.e., tag bits are
	high order, byte offset is lowest order).  (5 pts)

	Working from the bottom up, byte offset is 4 bits (16 byte lines).
	Bank select is 2 bits (4 banks).
	Each bank has 4K/16 = 256 lines.
	Cache line index is 8 bits (256 lines/bank).
	Tag is 32-(4+2+8) = 18 bits.

(b)  The processor requests a read at memory address 0x1000.  (Address
	is written in hexadecimal.  Ask me if you don't know what this
	means.)   This will miss in both the L1 and the victim cache,
	so the result has to be read from memory.  Indicate which line
	of which bank in the L1 is evicted.  Show what information the
	MSHR for that bank contains.  What are the tag bits for that line?
	(5 pts)

	Bottom 4 bits are byte offset (0x0), next 2 bits are bank select,
	so we want bank 0.  The next 8 bits tell us the cache line index,
	which is 0x40.  The tag bits are 18 bits of 0s.  The MSHR of bank 0
	indicates that it's filling a read request for line 0x40.

(c)  Assume that the miss from (b) hasn't been filled from memory yet.
	Next, we receive a request to read location 0x1020.  Which bank and
	cache line handle this request?  Does it block?  Why or why not?
	(5 pts)

	Bank 2, cache line 0x40.  No, because each bank has it's own MSHR.

(d)  Assume neither of the preceding misses have been filled from memory yet.
	We receive a request to read location 0x1040.  Which bank and cache
	line handle this request?  Does it block?  Why or why not?
	(5 pts)

	Bank 0, cache line 0x41.  Yes, the MSHR for bank 0 is still busy.

(e)  Now, assume the memory misses in parts (b), (c), and (d) have all
	been filled from memory, returning values B, C, and D respectively.
	(Assume that B, C, and D indicate the full 16 bytes of the line.)
	What are the contents of the caches and MSHRs?
	(5 pts)

	Victim cache is still unused.
	MSHRs are all empty
	Bank 0, line 0x40 contains B.
	Bank 0, line 0x41 contains D.
	Bank 2, line 0x40 contains C.

(f)  Next, a read to location 0x11000 arrives.  What are the contents of
	the caches and MSHRs?  (Remember that since the victim cache is
	fully associative, the tag in the victim cache is the full 32 bits.)
	(5 pts)

	Banks 1-3 have no change.
	Victim cache has an entry with tag 0x1000, data B.
	Bank 0, line 0x40 now has tag 0x4, data undefined (waiting for miss).
	MSHR for bank 0 indicates it's filling a request for line 0x40.

(g)  Assume the read in (f) completes, filling in data value F into the
	appropriate line of the appropriate bank.  Next comes a request for
	address 0x21000.  After this request is filled from memory (with
	value G), what are the contents of the caches and MSHRs?
	(5 pts)

	MSHRs are all empty.
	Victim cache now has two entries:  tag 0x1000, data B;
	and tag 0x11000, data F.
	Banks 1-3 are unchanged.
	Bank 0, line 0x40 now has tag 0x8, data G.

(h)  Now, another request for location 0x1000 comes in.  Show the contents
	of the caches and MSHRs after servicing this request.
	(10 pts)

	MSHRs still empty.
	Victim cache hits, L1 misses, evicts current entry at bank 0, line 0x40.
	Therefore, victim cache ends up with two entries:  tag0x11000, data F;
	and tag 0x21000, data G.
	L1, Bank 0, Line 0x40 now has tag 0x0, data B.
	Banks 1-3 unchanged.