Assume floating-point registers f0..f31, integer registers r0..r31
fmul, fdiv, fadd are single-precision floating point instructions.

1 floating point divider	10 cycle latency
1 floating point add/multiply	3 cycle latency for add or multiply
1 integer unit			1 cycle latency
1 branch unit			1 cycle latency

All units fully pipelined (i.e., you can start a new operation every cycle,
but the results aren't available until the indicated latency).

f0 = 1.0
f1 = 1.0
f2 = 0.0
r1 = 1000

loop:
l1:	fmul	f3, f0, f0	-- f3 = f0*f0
l2:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
l3:	fadd	f2, f2, f4	-- f2 += f4
l4:	fadd	f0, f0, f1	-- f0 += 1.0
l5:	subi	r1, r1, #1	-- r1--
l6:	bne	r1, loop

Show the execution schedule for the functional units for one iteration through
the loop.
How many cycles does one pass of the loop take to EXECUTE?  Ignore time
spent in other pipeline stages, just the time in execution, as in the
Fisher and Rau paper.

	Time\Unit	FDIV	FADD/FMUL	INT	BR
	1		---	l1		l5	---
	2		---	l4		---	---
	3		---	---		---	---
	4		l2	---		---	---
	5		--	---		---	---
	6		--	---		---	---
	7		--	---		---	---
	8		--	---		---	---
	9		--	---		---	---
	10		--	---		---	---
	11		--	---		---	---
	12		--	---		---	---
	13		--	---		---	---
	14		--	l3		---	---
	15		--	---		---	---
	16		--	---		---	l6

	16 cycles

If the floating point adder/multiplier were not pipelined (so you could
only do one operation at a time), would the code run slower?  Why or why not?

	No, the code would take the same amount of time.  l4 would start
	executing at cycle 4, but that's still much faster than the division.

Unroll the loop once (so you have two copies).  You do not need to show
the prologue to handle when r1 is odd.

	loop:
	l1:	fmul	f3, f0, f0	-- f3 = f0*f0
	l2:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	l3:	fadd	f2, f2, f4	-- f2 += f4
	l4:	fadd	f0, f0, f1	-- f0 += 1.0
	l5:	subi	r1, r1, #1	-- r1--
	l6:	fmul	f3, f0, f0	-- f3 = f0*f0
	l7:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	l8:	fadd	f2, f2, f4	-- f2 += f4
	l9:	fadd	f0, f0, f1	-- f0 += 1.0
	l10:	subi	r1, r1, #1	-- r1--
	l11:	bne	r1, loop

Assuming pipelined functional units again, show the execution schedule
for the pipelined (two copies) loop.  How many cycles does it take?

	Time\Unit	FDIV	FADD/FMUL	INT	BR
	1		---	l1		l5	---
	2		---	l4		l10	---
	3		---	---		---	---
	4		l2	---		---	---
	5		---	l6		---	---
	6		---	l9		---	---
	7		---	---		---	---
	8		l7	---		---	---
	9		---	---		---	---
	10		---	---		---	---
	11		---	---		---	---
	12		---	---		---	---
	13		---	---		---	---
	14		---	l3		---	---
	15		---	---		---	---
	16		---	---		---	---
	17		---	---		---	---
	18		---	l8		---	---
	19		---	---		---	---
	20		---	---		---	l11

Return to the original loop, and now assume that the functional units are
unpipelined (can only do one operation at a time) with the given latencies.
Using a pipeline, reservation station, and reorder buffer model as we did
in class, show the state of the reservation station, register mapping table,
and reorder buffer at the end of the clock cycle when l6 is fetched.
(Show earlier cycles for partial credit.)
	loop:
	l1:	fmul	f3, f0, f0	-- f3 = f0*f0
	l2:	fdiv	f4, f1, f3	-- f4 = 1.0/f3
	l3:	fadd	f2, f2, f4	-- f2 += f4
	l4:	fadd	f0, f0, f1	-- f0 += 1.0
	l5:	subi	r1, r1, #1	-- r1--
	l6:	bne	r1, loop

	CYCLE 1:
		fetched l1
	CYCLE 2:
		fetched l2
		dispatch l1 to reservation station
			rs1: op fmul, f0=1.0, f0=1.0, dest rob1
			mapping table f3->rob1
	CYCLE 3:
		fetched l3
		dispatch l2 to reservation station
		issue rs1 to fadd/fmul
			rs1: op fmul, f0=1.0, f0=1.0, dest rob1
			rs2: op fdiv, f1=1.0, rob1=?, dest rob2
			mapping table f3->rob1, f4->rob2
	CYCLE 4:
		fetched l4
		dispatch l3 to reservation station
		fadd/fmul still busy with rs1
			rs1: op fmul, f0=1.0, f0=1.0, dest rob1
			rs2: op fdiv, f1=1.0, rob1=?, dest rob2
			rs3: op fadd, f2=0.0, rob2=?, dest rob3
			mapping table f3->rob1, f4->rob2, f2->rob3
	CYCLE 5:
		fetched l5
		dispatch l4 to reservation station
		fadd/fmul completes rs1
			rs1: op fmul, f0=1.0, f0=1.0, dest rob1
			rs2: op fdiv, f1=1.0, rob1=?, dest rob2
			rs3: op fadd, f2=0.0, rob2=?, dest rob3
			rs4: op fadd, f0=1.0, f1=1.0, dest rob4
			mapping table f3->rob1, f4->rob2, f2->rob3, f0->rob4
	CYCLE 6:
		fetched l6
		dispatch l5 to reservation station
		rs1 waiting to commit
		rs2 grabs result of rs1
		issue rs2 to fdiv
		issue rs4 to fadd/fmul
			rs1: op fmul, f0=1.0, f0=1.0, dest rob1=1.0
			rs2: op fdiv, f1=1.0, rob1=1.0, dest rob2
			rs3: op fadd, f2=0.0, rob2=?, dest rob3
			rs4: op fadd, f0=1.0, f1=1.0, dest rob4
			rs5: op sub, r1=1000, imm=1, dest rob5
			mapping table f3->rob1, f4->rob2, f2->rob3, f0->rob4, r1->rob5

Extra Credit:  What number is the loop computing (in the limit with
infinite iterations)?
	Reimann's zeta function of 2, which is equal to pi^2/6.