CPSC 418, Problem Set #2, Due 10am, February 23, 1998 Problem 1: Assume floating-point registers f0..f31, integer registers r0..r31. fmul, fdiv, fadd are single-precision floating point instructions. The processor has the following functional units: 1 floating point divider 6 cycle latency 1 floating point add/multiply 2 cycle latency for add or multiply 1 integer unit 1 cycle latency 1 branch unit 1 cycle latency All units fully pipelined (i.e., you can start a new operation every cycle, but the results aren't available until the indicated latency). Initially, assume the registers have the following values: f0 = 1.0 f1 = 0.0 f2 = 1.0 f3 = 0.0 f4 = 1.0 r1 = 10 Consider this code fragment: loop: L1: fadd f1, f1, f0 -- f1 += 1.0 L2: fmul f2, f2, f1 -- f2 *= f1 L3: fdiv f3, f0, f2 -- f3 = 1.0/f2 L4: fadd f4, f4, f3 -- f4 += f3 L5: subi r1, r1, #1 -- r1-- L6: bne r1, loop -- if r1==0 goto loop: (a) Show a good execution schedule for the functional units for one iteration through the loop. How many cycles does one pass of the loop take to EXECUTE? Ignore time spent in other pipeline stages, just the time in execution, as in the Fisher and Rau paper. (b) If we unroll the loop once (ignoring the case where r1 is odd), we get the following code: loop: L1: fadd f1, f1, f0 -- f1 += 1.0 L2: fmul f2, f2, f1 -- f2 *= f1 L3: fdiv f3, f0, f2 -- f3 = 1.0/f2 L4: fadd f4, f4, f3 -- f4 += f3 L5: fadd f1, f1, f0 -- f1 += 1.0 L6: fmul f2, f2, f1 -- f2 *= f1 L7: fdiv f3, f0, f2 -- f3 = 1.0/f2 L8: fadd f4, f4, f3 -- f4 += f3 L9: subi r1, r1, #2 -- r1 -= 2 L10: bne r1, loop -- if r1==0 goto loop: Show a good execution schedule for this unrolled loop. How many cycles does it take now? About how much speed-up do you get by unrolling? Problem 2: Now, we'll examine how a superscalar processor would execute the code in Problem 1(b). For this problem, assume the same functional units as in Problem 1 with the same latencies. Assume the processor is capable of fetching two instructions per clock, decoding and dispatching two instructions per clock, issuing up to four operations per clock (1 per functional unit), and retiring up to four operations per clock. The processor has four reservation station slots, and six reorder buffer slots. We start the processor in a clean state (no pending instructions) at L1. After 4 clock cycles, the processor state evolves as follows: Clock 1: Fetched Instructions: L1 and L2 Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst) RS1: RS2: RS3: RS4: Retirement Buffer: Register Mapping Table: r1: r1 f0: f0 f1: f1 f2: f2 f3: f3 f4: f4 Registers Values: r1: 10 f0: 1.0 f1: 0.0 f2: 1.0 f3: 0.0 f4: 1.0 Reorder Buffer (value, ready, dst) rob1: rob2: rob3: rob4: rob5: rob6: Functional Units: FDIV: FADD/FMUL: Integer ALU: Branch: Clock 2: Fetched Instructions: L3 and L4 Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst) RS1: (fadd, f1, 0.0, yes, f0, 1.0, yes, rob1) RS2: (fmul, f2, 1,0, yes, rob1, ?, no, rob2) RS3: RS4: Retirement Buffer: Register Mapping Table: r1: r1 f0: f0 f1: rob1 f2: rob2 f3: f3 f4: f4 Registers Values: r1: 10 f0: 1.0 f1: 0.0 f2: 1.0 f3: 0.0 f4: 1.0 Reorder Buffer (value, ready, dst) rob1: (?,no,f1) rob2: (?,no,f2) rob3: rob4: rob5: rob6: Functional Units: FDIV: FADD/FMUL: Integer ALU: Branch: Clock 3: SUPPLY THE MISSING STATE HERE! Clock 4: Fetched Instructions: L6 and L7 Reservation Stations (op,src1,value1,ready?,src2,value2,ready?,dst) RS1: (fadd, rob1, ?, no, f0, 1.0, yes, rob5) RS2: (fmul, f2, 1.0, yes, rob1, ?, no, rob2) RS3: (fdiv, f0, 1.0, yes, rob2, ?, no, rob3) RS4: (fadd, f4, 1.0, yes, rob3, ?, no, rob4) Retirement Buffer: Register Mapping Table: r1: r1 f0: f0 f1: rob5 f2: rob2 f3: rob3 f4: rob4 Registers Values: r1: 10 f0: 1.0 f1: 0.0 f2: 1.0 f3: 0.0 f4: 1.0 Reorder Buffer (value, ready, dst) rob1: (?,no,f1) rob2: (?,no,f2) rob3: (?,no,f3) rob4: (?,no,f4) rob5: (?,no,f1) rob6: Functional Units: FDIV: FADD/FMUL: 0.0+1.0 -> rob1 Integer ALU: Branch: Your problem is to supply the state of the processor after Clock 3. Problem 3: For the superscalar processor in Problem 2, consider the following scheme for handling exceptions: When an exception occurs in a functional unit during execution of an instruction, immediately commit everything that is legal to commit in the retirement buffer, and flush everything in the reservation stations, register mapping table, reorder buffer, and functional units. Set the PC to point to the instruction that raised the exception. Does this scheme guarantee precise exceptions? Why or why not? Problem 4: Initially, assume the registers have the following values: f0 = 0.0 f1 = 1.0 f2 = 2.0 f3 = 1999.0 r1 = 1000 Consider this code fragment: loop: L1: andi r2, r1, #1 -- r2 = r1^0x1 (r2 = r1 mod 2) L2: fdiv f4, f1, f3 -- f4 = 1.0/f3 L3: beq r2, L6 -- if r2==0 goto L6: L4: fadd f0, f0, f4 -- f0 += f4 L5 br L7 -- goto L7: L6: fsub f0, f0, f4 -- f0 -= f4 L7: fsub f3, f3, f2 -- f3 -= 2.0; L8: subi r1, r1, #1 -- r1 -= 1; L9: bne r1, loop -- if r1!=0 goto loop: Note that there are three branches: L3, L5, and L9. Since L5 is unconditional, you should assume that it is "predicted" correctly in all parts of this problem. (a) How many times is each branch executed? (b) If we predict always taken, what percentage do we predict correctly. (Include the correct "predictions" of L5.) (c) If we use 2-bit saturating counters for branches L3 and L9, initialized to 10 (weakly predict taken), what will be the counter values for each branch after the loop terminates? (d) What is the prediction rate for part (c)? (e) Suppose we use a 2-level branch predictor, with a separate 2-bit history register for each branch (initialized to 00), and a single, global pattern table for all branches (with all entries initialized to 00 strongly predict not taken). Show the history registers and pattern table after one pass through the loop (including executing L9). (f) Show the history registers and pattern table after another pass through the loop. (g) As the loop continues, will the two-level scheme perform well or poorly? Justify your answer. Problem 5: Non-Blocking and Victim Caches For this problem, we will combine the banked, non-blocking caches from the Sohi and Franklin paper, with the victim cache idea presented in the Jouppi and Wilton paper. Assume a machine with 32-bit addresses, byte addressable. We have a 64KByte L1 data cache, divided into 4 banks of 16KBytes each. The L1 cache is direct-mapped. The L1 is also non-blocking, with 1 MSHR per bank. Backing up the L1 is a 16 entry, direct-mapped victim cache. Cache lines are 16 bytes each in both the L1 and victim caches. Assume that the caches are initialized to garbage values, so all accesses will miss until we put something there. (a) How many bits of the 32-bit address are devoted to tag, cache line index, bank select, and byte offset, respectively, when looking up an address in the L1. For the remainder of this problem, assume the address fields are in that order (i.e., tag bits are high order, byte offset is lowest order). (b) The processor requests a read at memory address 0x1000. (Address is written in hexadecimal. Ask me if you don't know what this means.) This will miss in both the L1 and the victim cache, so the result has to be read from memory. Indicate which line of which bank in the L1 is evicted. What are the tag bits for that line? Show what information the MSHR for that bank contains. (c) Assume that the miss from (b) hasn't been filled from memory yet. Next, we receive a request to read location 0x1020. Which bank and cache line handle this request? Does it block? Why or why not? (d) Assume neither of the preceding misses have been filled from memory yet. We receive a request to read location 0x2020. Which bank and cache line handle this request? Does it block? Why or why not? (e) Now, assume the memory misses in parts (b), (c), and (d) have all been filled from memory, returning values B, C, and D respectively. (Assume that B, C, and D indicate the full 16 bytes of the line.) What are the contents of the caches and MSHRs? (f) Next, a read to location 0x12020 arrives. What are the contents of the caches and MSHRs? (g) Assume the read in (f) completes, filling in data value F into the appropriate line of the appropriate bank. Next comes a request for address 0x2020. What are the contents of the caches after this miss is handled? Extra Credit: (1 pt) What number is the loop in Problem 1 computing (in the limit with infinite iterations)?