Assume floating-point registers f0..f31, integer registers r0..r31 fmul, fdiv, fadd are single-precision floating point instructions. 1 floating point divider 10 cycle latency 1 floating point add/multiply 3 cycle latency for add or multiply 1 integer unit 1 cycle latency 1 branch unit 1 cycle latency All units fully pipelined (i.e., you can start a new operation every cycle, but the results aren't available until the indicated latency). f0 = 1.0 f1 = 1.0 f2 = 0.0 r1 = 1000 loop: l1: fmul f3, f0, f0 -- f3 = f0*f0 l2: fdiv f4, f1, f3 -- f4 = 1.0/f3 l3: fadd f2, f2, f4 -- f2 += f4 l4: fadd f0, f0, f1 -- f0 += 1.0 l5: subi r1, r1, #1 -- r1-- l6: bne r1, loop Show the execution schedule for the functional units for one iteration through the loop. How many cycles does one pass of the loop take to EXECUTE? Ignore time spent in other pipeline stages, just the time in execution, as in the Fisher and Rau paper. Time\Unit FDIV FADD/FMUL INT BR 1 --- l1 l5 --- 2 --- l4 --- --- 3 --- --- --- --- 4 l2 --- --- --- 5 -- --- --- --- 6 -- --- --- --- 7 -- --- --- --- 8 -- --- --- --- 9 -- --- --- --- 10 -- --- --- --- 11 -- --- --- --- 12 -- --- --- --- 13 -- --- --- --- 14 -- l3 --- --- 15 -- --- --- --- 16 -- --- --- l6 16 cycles If the floating point adder/multiplier were not pipelined (so you could only do one operation at a time), would the code run slower? Why or why not? No, the code would take the same amount of time. l4 would start executing at cycle 4, but that's still much faster than the division. Unroll the loop once (so you have two copies). You do not need to show the prologue to handle when r1 is odd. loop: l1: fmul f3, f0, f0 -- f3 = f0*f0 l2: fdiv f4, f1, f3 -- f4 = 1.0/f3 l3: fadd f2, f2, f4 -- f2 += f4 l4: fadd f0, f0, f1 -- f0 += 1.0 l5: subi r1, r1, #1 -- r1-- l6: fmul f3, f0, f0 -- f3 = f0*f0 l7: fdiv f4, f1, f3 -- f4 = 1.0/f3 l8: fadd f2, f2, f4 -- f2 += f4 l9: fadd f0, f0, f1 -- f0 += 1.0 l10: subi r1, r1, #1 -- r1-- l11: bne r1, loop Assuming pipelined functional units again, show the execution schedule for the pipelined (two copies) loop. How many cycles does it take? Time\Unit FDIV FADD/FMUL INT BR 1 --- l1 l5 --- 2 --- l4 l10 --- 3 --- --- --- --- 4 l2 --- --- --- 5 --- l6 --- --- 6 --- l9 --- --- 7 --- --- --- --- 8 l7 --- --- --- 9 --- --- --- --- 10 --- --- --- --- 11 --- --- --- --- 12 --- --- --- --- 13 --- --- --- --- 14 --- l3 --- --- 15 --- --- --- --- 16 --- --- --- --- 17 --- --- --- --- 18 --- l8 --- --- 19 --- --- --- --- 20 --- --- --- l11 Return to the original loop, and now assume that the functional units are unpipelined (can only do one operation at a time) with the given latencies. Using a pipeline, reservation station, and reorder buffer model as we did in class, show the state of the reservation station, register mapping table, and reorder buffer at the end of the clock cycle when l6 is fetched. (Show earlier cycles for partial credit.) loop: l1: fmul f3, f0, f0 -- f3 = f0*f0 l2: fdiv f4, f1, f3 -- f4 = 1.0/f3 l3: fadd f2, f2, f4 -- f2 += f4 l4: fadd f0, f0, f1 -- f0 += 1.0 l5: subi r1, r1, #1 -- r1-- l6: bne r1, loop CYCLE 1: fetched l1 CYCLE 2: fetched l2 dispatch l1 to reservation station rs1: op fmul, f0=1.0, f0=1.0, dest rob1 mapping table f3->rob1 CYCLE 3: fetched l3 dispatch l2 to reservation station issue rs1 to fadd/fmul rs1: op fmul, f0=1.0, f0=1.0, dest rob1 rs2: op fdiv, f1=1.0, rob1=?, dest rob2 mapping table f3->rob1, f4->rob2 CYCLE 4: fetched l4 dispatch l3 to reservation station fadd/fmul still busy with rs1 rs1: op fmul, f0=1.0, f0=1.0, dest rob1 rs2: op fdiv, f1=1.0, rob1=?, dest rob2 rs3: op fadd, f2=0.0, rob2=?, dest rob3 mapping table f3->rob1, f4->rob2, f2->rob3 CYCLE 5: fetched l5 dispatch l4 to reservation station fadd/fmul completes rs1 rs1: op fmul, f0=1.0, f0=1.0, dest rob1 rs2: op fdiv, f1=1.0, rob1=?, dest rob2 rs3: op fadd, f2=0.0, rob2=?, dest rob3 rs4: op fadd, f0=1.0, f1=1.0, dest rob4 mapping table f3->rob1, f4->rob2, f2->rob3, f0->rob4 CYCLE 6: fetched l6 dispatch l5 to reservation station rs1 waiting to commit rs2 grabs result of rs1 issue rs2 to fdiv issue rs4 to fadd/fmul rs1: op fmul, f0=1.0, f0=1.0, dest rob1=1.0 rs2: op fdiv, f1=1.0, rob1=1.0, dest rob2 rs3: op fadd, f2=0.0, rob2=?, dest rob3 rs4: op fadd, f0=1.0, f1=1.0, dest rob4 rs5: op sub, r1=1000, imm=1, dest rob5 mapping table f3->rob1, f4->rob2, f2->rob3, f0->rob4, r1->rob5 Extra Credit: What number is the loop computing (in the limit with infinite iterations)? Reimann's zeta function of 2, which is equal to pi^2/6.