CPSC 418, Problem Set #2, Due 10am, February 23, 1998


Problem 1:

Assume floating-point registers f0..f31, integer registers r0..r31.
fmul, fdiv, fadd are single-precision floating point instructions.

The processor has the following functional units:
1 floating point divider	6 cycle latency
1 floating point add/multiply	2 cycle latency for add or multiply
1 integer unit			1 cycle latency
1 branch unit			1 cycle latency
All units fully pipelined (i.e., you can start a new operation every cycle,
but the results aren't available until the indicated latency).

Initially, assume the registers have the following values:
	f0 = 1.0
	f1 = 0.0
	f2 = 1.0
	f3 = 0.0
	f4 = 1.0
	r1 = 10

Consider this code fragment:
	loop:
	L1:	fadd	f1, f1, f0	-- f1 += 1.0
	L2:	fmul	f2, f2, f1	-- f2 *= f1
	L3:	fdiv	f3, f0, f2	-- f3 = 1.0/f2
	L4:	fadd	f4, f4, f3	-- f4 += f3
	L5:	subi	r1, r1, #1	-- r1--
	L6:	bne	r1, loop	-- if r1==0 goto loop:

(a) Show a good execution schedule for the functional units for one iteration
through the loop.  How many cycles does one pass of the loop take to EXECUTE?
Ignore time spent in other pipeline stages, just the time in execution,
as in the Fisher and Rau paper.

	We don't actually have much room for ILP, as there is a chain
	of data dependencies from L1 to L2 to L3 to L4.  So, you get
	something like:
		FDIV		FADD/FMUL	Int		Br
	1	--		L1		L5		--
	2	--		--		--		L6
	3	--		L2		--		--
	4	--		--		--		--
	5	L3		--		--		--
	6	--		--		--		--
	7	--		--		--		--
	8	--		--		--		--
	9	--		--		--		--
	10	--		--		--		--
	11	--		L4		--		--
	12	--		--		--		--

	Which takes 12 cycles per loop.

(b) If we unroll the loop once (ignoring the case where r1 is odd),
we get the following code:
	loop:
	L1:	fadd	f1, f1, f0	-- f1 += 1.0
	L2:	fmul	f2, f2, f1	-- f2 *= f1
	L3:	fdiv	f3, f0, f2	-- f3 = 1.0/f2
	L4:	fadd	f4, f4, f3	-- f4 += f3
	L5:	fadd	f1, f1, f0	-- f1 += 1.0
	L6:	fmul	f2, f2, f1	-- f2 *= f1
	L7:	fdiv	f3, f0, f2	-- f3 = 1.0/f2
	L8:	fadd	f4, f4, f3	-- f4 += f3
	L9:	subi	r1, r1, #2	-- r1 -= 2
	L10:	bne	r1, loop	-- if r1==0 goto loop:
Show a good execution schedule for this unrolled loop.  How many cycles
does it take now?  About how much speed-up do you get by unrolling?

	Things are much better now that we have two copies of the loop
	to play with.  Unfortunately, we only have one FADD/FMUL unit,
	but it's pipelined.  So, we get something like:
		FDIV		FADD/FMUL	Int		Br
	1	--		L1		L9		--
	2	--		--		--		L10
	3	--		L2		--		--
	4	--		L5		--		--
	5	L3		--		--		--
	6	--		L6		--		--
	7	--		--		--		--
	8	L7		--		--		--
	9	--		--		--		--
	10	--		--		--		--
	11	--		L4		--		--
	12	--		--		--		--
	13	--		--		--		--
	14	--		L8		--		--
	15	--		--		--		--

	Which takes 15 cycles per loop (two original loops), so
	we get a speedup of 2*12/15 = 1.6x.


Problem 2:

Now, we'll examine how a superscalar processor would execute the code
in Problem 1(b).  For this problem, assume the same functional units
as in Problem 1 with the same latencies.

Assume the processor is capable of fetching two instructions per clock,
decoding and dispatching two instructions per clock, issuing up to four
operations per clock (1 per functional unit), and retiring up to
four operations per clock.  The processor has four reservation station slots,
and six reorder buffer slots.

We start the processor in a clean state (no pending instructions) at L1.
After 4 clock cycles, the processor state evolves as follows:
Clock 1:

Fetched Instructions:  L1 and L2
Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst)
	RS1:  <idle>
	RS2:  <idle>
	RS3:  <idle>
	RS4:  <idle>
Retirement Buffer:  <idle>
Register Mapping Table:
	r1:	r1
	f0:	f0
	f1:	f1
	f2:	f2
	f3:	f3
	f4:	f4
Registers Values:
	r1:	10
	f0:	1.0
	f1:	0.0
	f2:	1.0
	f3:	0.0
	f4:	1.0
Reorder Buffer (value, ready, dst)
	rob1:	<idle>
	rob2:	<idle>
	rob3:	<idle>
	rob4:	<idle>
	rob5:	<idle>
	rob6:	<idle>
Functional Units:
	FDIV:		<idle>
	FADD/FMUL:	<idle>
	Integer ALU:	<idle>
	Branch:		<idle>

Clock 2:

Fetched Instructions:  L3 and L4
Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst)
	RS1:  (fadd, f1, 0.0, yes, f0, 1.0, yes, rob1)
	RS2:  (fmul, f2, 1,0, yes, rob1, ?, no, rob2)
	RS3:  <idle>
	RS4:  <idle>
Retirement Buffer:  <idle>
Register Mapping Table:
	r1:	r1
	f0:	f0
	f1:	rob1
	f2:	rob2
	f3:	f3
	f4:	f4
Registers Values:
	r1:	10
	f0:	1.0
	f1:	0.0
	f2:	1.0
	f3:	0.0
	f4:	1.0
Reorder Buffer (value, ready, dst)
	rob1:	(?,no,f1)
	rob2:	(?,no,f2)
	rob3:	<idle>
	rob4:	<idle>
	rob5:	<idle>
	rob6:	<idle>
Functional Units:
	FDIV:		<idle>
	FADD/FMUL:	<idle>
	Integer ALU:	<idle>
	Branch:		<idle>

Clock 3:

Fetched Instructions:  L5 and L6
Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst)
	RS1:  <idle>
	RS2:  (fmul, f2, 1.0, yes, rob1, ?, no, rob2)
	RS3:  (fdiv, f0, 1.0, yes, rob2, ?, no, rob3)
	RS4:  (fadd, f4, 1.0, yes, rob3, ?, no, rob4)
Retirement Buffer:  <idle>
Register Mapping Table:
	r1:	r1
	f0:	f0
	f1:	rob1
	f2:	rob2
	f3:	rob3
	f4:	rob4
Registers Values:
	r1:	10
	f0:	1.0
	f1:	0.0
	f2:	1.0
	f3:	0.0
	f4:	1.0
Reorder Buffer (value, ready, dst)
	rob1:	(?,no,f1)
	rob2:	(?,no,f2)
	rob3:	(?,no,f3)
	rob4:	(?,no,f4)
	rob5:	<idle>
	rob6:	<idle>
Functional Units:
	FDIV:		<idle>
	FADD/FMUL:	0.0+1.0 -> rob1
	Integer ALU:	<idle>
	Branch:		<idle>

Clock 4:

Fetched Instructions:  L6 and L7
Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst)
	RS1:  (fadd, rob1, ?, no, f0, 1.0, yes, rob5)
	RS2:  (fmul, f2, 1.0, yes, rob1, ?, no, rob2)
	RS3:  (fdiv, f0, 1.0, yes, rob2, ?, no, rob3)
	RS4:  (fadd, f4, 1.0, yes, rob3, ?, no, rob4)
Retirement Buffer:  <idle>
Register Mapping Table:
	r1:	r1
	f0:	f0
	f1:	rob5
	f2:	rob2
	f3:	rob3
	f4:	rob4
Registers Values:
	r1:	10
	f0:	1.0
	f1:	0.0
	f2:	1.0
	f3:	0.0
	f4:	1.0
Reorder Buffer (value, ready, dst)
	rob1:	(?,no,f1)
	rob2:	(?,no,f2)
	rob3:	(?,no,f3)
	rob4:	(?,no,f4)
	rob5:	(?,no,f1)
	rob6:	<idle>
Functional Units:
	FDIV:		<idle>
	FADD/FMUL:	0.0+1.0 -> rob1
	Integer ALU:	<idle>
	Branch:		<idle>


Problem 3:

For the superscalar processor in Problem 2, consider the following scheme
for handling exceptions:  When an exception occurs in a functional
unit during execution of an instruction, immediately commit everything
that is legal to commit in the retirement buffer, and flush everything in the
reservation stations, register mapping table, reorder buffer, and functional
units.  Set the PC to point to the instruction that raised the exception.
Does this scheme guarantee precise exceptions?  Why or why not?

	No.  We do have the PC pointing to the right location, and
	we're also guaranteed that none of the instructions after
	the one raising the exception will have modified the architectural
	state (since we only do that when committing things in the
	retirement buffer).  However, some instructions that came
	before the faulting instruction might not have time to complete.
	For example, suppose we had an FDIV, followed by an FMUL that
	overflows.  The FDIV will still be in the functional unit (because
	of the long latency) when the FMUL raises the exception.


Problem 4:

Initially, assume the registers have the following values:
	f0 = 0.0
	f1 = 1.0
	f2 = 2.0
	f3 = 1999.0
	r1 = 1000

Consider this code fragment:
	loop:
	L1:	andi	r2, r1, #1	-- r2 = r1^0x1 (r2 = r1 mod 2)
	L2:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	L3:	beq	r2, L7		-- if r2==0 goto L6:
	L4:	fadd	f0, f0, f4	-- f0 += f4
	L5	br	L7		-- goto L7:
	L6:	fsub	f0, f0, f4	-- f0 -= f4
	L7:	fsub	f3, f3, f2	-- f3 -= 2.0;
	L8:	subi	r1, r1, #1	-- r1 -= 1;
	L9:	bne	r1, loop	-- if r1!=0 goto loop:

Note that there are three branches:  L3, L5, and L9.  Since L5 is
unconditional, you should assume that it is "predicted" correctly in all
parts of this problem.
(a)  How many times is each branch executed?
	
	L3, 1000 times.  L5, 500 times.  L9, 1000 times.

(b)  If we predict always taken, what percentage do we predict correctly.
	(Include the correct "predictions" of L5.)

	L3 is taken half of the time, so we get 500 correct.
	L5 is taken all the time, so we get 500 correct.
	L9 is taken all but the last time, so we get 999 correct.
	Therefore, prediction rate is (500+500+999)/(1000+500+1000) = 79.96%

(c)  If we use 2-bit saturating counters for branches L3 and L9, initialized
	to 10 (weakly predict taken), what will be the counter values for
	each branch after the loop terminates?

	L3 will toggle between 10 and 11.  On the last pass through the
	loop, the branch isn't taken, so the final value will be 10.

	L5 we are ignoring, because it's unconditional.  (Otherwise,
	it'd be 11 strongly predict taken.)

	L9 will go to 11 and stay there until the last iteration, which
	puts it back to 10.

(d)  What is the prediction rate for part (c)?

	Every prediction will be (weakly or strongly) predict taken,
	so we get the exact same results as in part (b):  79.96%.

(e)  Suppose we use a 2-level branch predictor, with a separate 2-bit
	history register for each branch (initialized to 00), and
	a single, global pattern table for all branches (with all entries
	initialized to 00 strongly predict not taken).  Show the history
	registers and pattern table after one pass through the loop
	(including executing L9).

	We reach L3, history is 00, pattern[00]=00, strongly predict
	not taken, wrong.  Update pattern[00]=01, history[L3]=10.
	We reach L9, history is 00, pattern[00]=01, weakly predict
	not taken, wrong.  Update pattern[00]=10, history[L9]=10.

	So, we end up with:
	history[L3] = 10
	history[L5] = 00 (optional -- depending on whether we ignore L5)
	history[L9] = 10
	pattern[00] = 10
	pattern[01] = 00
	pattern[10] = 00
	pattern[11] = 00

(f)  Show the history registers and pattern table after another pass through
	the loop.

	The problem is ambiguous depending on whether or not we let L5
	affect the global pattern table.

	IGNORING L5:
		Reach L3, history[L3]=10, pattern[10]=00, predict SNT, right.
			Update pattern[10]=00, history[L3]=01.
		Reach L5, ignore.
		Reach L9, history[L9]=10, pattern[10]=00, predict SNT, wrong.
			Update pattern[10]=01, history[L9]=11.
		End up with:
			history[L3] = 01
			history[L5] = ignored
			history[L9] = 11
			pattern[00] = 10
			pattern[01] = 00
			pattern[10] = 01
			pattern[11] = 00
	INCLUDING L5:
		Reach L3, history[L3]=10, pattern[10]=00, predict SNT, right.
			Update pattern[10]=00, history[L3]=01.
		Reach L5, history[L5]=00, pattern[00]=10, predict WT, right.
			Update pattern[00]=11, history[L5]=10.
		Reach L9, history[L9]=10, pattern[10]=00, predict SNT, wrong.
			Update pattern[10]=01, history[L9]=11.
		End up with:
			history[L3] = 01
			history[L5] = 10
			history[L9] = 11
			pattern[00] = 11
			pattern[01] = 00
			pattern[10] = 01
			pattern[11] = 00

(g)  As the loop continues, will the two-level scheme perform well or
	poorly?  Justify your answer.

	It will perform very well.  Within a few iterations, the history
	register for L9 (and L5) will be 11, and pattern[11]=11, strongly
	predict taken.  So almost all occurrences of L9 (and L5) will be
	correct.  The history register for L3 will toggle between 01 and 10.
	The pattern[01] will be 11 strongly taken, and pattern[10] will be
	00 strongly not taken, so almost all instances of L3 will be
	predicted correctly, too.  Overall, we will get very close to 100%.

Problem 5:  Non-Blocking and Victim Caches

For this problem, we will combine the banked, non-blocking caches from
the Sohi and Franklin paper, with the victim cache idea presented in
the Jouppi and Wilton paper.

Assume a machine with 32-bit addresses, byte addressable.  We have a
64KByte L1 data cache, divided into 4 banks of 16KBytes each.  The L1
cache is direct-mapped.  The L1 is also non-blocking, with 1 MSHR
per bank.  Backing up the L1 is a 16 entry, direct-mapped victim cache.
Cache lines are 16 bytes each in both the L1 and victim caches.
Assume that the caches are initialized to garbage values, so all
accesses will miss until we put something there.  

(a)  How many bits of the 32-bit address are devoted to tag, cache line
	index, bank select, and byte offset, respectively, when looking
	up an address in the L1.  For the remainder of this problem,
	assume the address fields are in that order (i.e., tag bits are
	high order, byte offset is lowest order).

	tag 16 bits
	index 10 bits
	bank select 2 bits
	byte offset 4 bits

(b)  The processor requests a read at memory address 0x1000.  (Address
	is written in hexadecimal.  Ask me if you don't know what this
	means.)   This will miss in both the L1 and the victim cache,
	so the result has to be read from memory.  Indicate which line
	of which bank in the L1 is evicted.  What are the tag bits for
	that line? Show what information the MSHR for that bank contains.

	Bank 0, Line 0x40, Tag Bits 0x0
	MSHR in Bank 0 indicates a pending read request for Line 0x40.

(c)  Assume that the miss from (b) hasn't been filled from memory yet.
	Next, we receive a request to read location 0x1020.  Which bank and
	cache line handle this request?  Does it block?  Why or why not?

	Bank 2, Line 0x40, Tag Bits 0x0
	It does not block, because each bank has its own MSHR.
	MSHR for Bank 2 now indicates a pending read for Line 0x40.

(d)  Assume neither of the preceding misses have been filled from memory yet.
	We receive a request to read location 0x2020.  Which bank and cache
	line handle this request?  Does it block?  Why or why not?

	Bank 2, Line 0x80, Tag Bits 0x0
	It must block, because the MSHR for Bank 2 is still busy.

(e)  Now, assume the memory misses in parts (b), (c), and (d) have all
	been filled from memory, returning values B, C, and D respectively.
	(Assume that B, C, and D indicate the full 16 bytes of the line.)
	What are the contents of the caches and MSHRs?

	All MSHRs are now empty.
	Bank 0, Line 0x40 has Tag 0x0, Data B.
	Bank 2, Line 0x40 has Tag 0x0, Data C.
	Bank 2, Line 0x80 has Tag 0x0, Data D.
	Other cache lines are unknown.

(f)  Next, a read to location 0x12020 arrives.  What are the contents of
	the caches and MSHRs?

	Read is to Bank 2, Line 0x80, Tag 0x1, which misses.
	MSHR for Bank 2 indicates pending read for Line 0x80.
	Victim Cache Line 2 now has Tag 0x000020, Data D.
	Bank 0, Line 0x40 still has Tag 0x0, Data B.
	Bank 2, Line 0x40 still has Tag 0x0, Data C.

(g)  Assume the read in (f) completes, filling in data value F into the
	appropriate line of the appropriate bank.  Next comes a request for
	address 0x2020.  What are the contents of the caches after this
	miss is handled?

	Read is to Bank 2, Line 0x80, Tag 0x0, which misses in L1, but
	hits in victim cache, so the lines get swapped.
	Victim Cache Line 2 now has Tag 0x000120, Data F.
	Bank 0, Line 0x40 still has Tag 0x0, Data B.
	Bank 2, Line 0x40 still has Tag 0x0, Data C.
	Bank 2, Line 0x80 now has Tag 0x0, Data D

Extra Credit:  (1 pt)

What number is the loop in Problem 1 computing (in the limit with
infinite iterations)?

	The number e.