Prev | CPSC 418 Home Page | Next |
CPSC 418 - Homework #3 Solution 1. a) Unfortunately, the answer depended on how many execution units you assumed. When I wrote the question originally, I was assuming unlimited execution units. In that case, the solution is: Result Shift Register Stage Functional Unit Source Valid Tag ----- ---------------------- ----- --- 1 Integer Add 1 6 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 Reorder Buffer Entry Dest. Program Number Reg Result Exceptions Valid Counter ------ ----- ------ ---------- ----- ------- 1 (R0) (1) 2 (R0) (2) 3 (R0) (3) Head-> 4 R2 X 1 4 5 R1 X 1 5 6 R6 6 7 R0 7 Tail-> 8 Notes: - 10 stages are required in the RSR because the maximum instruction length is 10 (for a divide). - The first 3 instructions have all finished and clear the ROB -- ROB entries 1-3 should really be empty, but I've shown their former contents. - All instructions were already in the ROB at the beginning, since ROB entries are allocated on dispatch and in program order. - The multiply was independent of all of instructions, so it has completed by the time the first load finishes. Its exception has been noted in the ROB, but cannot be raised until it gets to the head of the ROB. - The first load just exitted the RSR, and so can raise its exception - The second load is now at the head of the RSR, and its exception would be noted this cycle if the first load hadn't just raised its exception (halting further execution). - The final addi has been unable to issue because the top entry of the RSR (the entry it must reserve in order to issue) has been full every cycle since the time the last RAW hazard was resolved. You could reasonably have assumed a different number of execution units. Here are the reasonable assumptions I can think of, and how they change the solution: i) Separate pipelines for the mult, loads and adds (so 3 pipelines) The mult gets to go right away, and the addi still cannot issue because it cannot reserve a spot in the RSR. The solution is the same as that given above. ii) A separate pipeline for the mult instruction The mult gets to go right away, and so the solution is the same as that given above. iii) A single execution unit for all instructions. In this case, the instructions issue in program order, and the mult is significantly delayed. The mult will issue the first cycle after the first lw. The second cycle after the first lw issues, its exception will be raised. The RSR will contain only the mult in stage 3 -- the second lw will not have issued. Only the first lw's exception will have been noted by the time it is raised. ------------------------------------------- b) i) ID stage - Disk I/O is an external interrupt that can be checked at the issue stage, so instruction issuing can be halted and all previously issued instructions can complete. ii) ROB - Operands are not actually looked at when they are fetched in the ID stage and thus it is only possible to raise exception in the ROB. iii) ID stage - Breakpoints are trap instructions that can be detected in the ID stage. iv) ROB - Don't know until the ALU stage that it is an unaligned access so it can only be raised in the ROB. ------------------------------------------- c) History Buffer 32 interconnects : From execution unit to register file 96 interconnects : From register file to ID stage (3 read ports for each register) 32 interconnects : From History buffer to register file (a write port from the tail to all registers to commit values) --- Total = 160 interconnects Future File 64 interconnects : From future file to ID stage (2 read ports for each register) 32 interconnects : From ROB to architectural file 32 interconnects : From architectural file to future file 8 interconnects : From execution unit to ROB 32 interconnects : From execution unit to future file --- Total = 168 interconnects This is not a very good question -- one mark was given for each case as long as your number of interconnects for the history buffer and future file were significantly less than the number of interconnects for the ROB with bypassing. Justin and Greg both pointed out to me that real processors will use some kind of bus system, reducing the relative cost of ROBs and explaining why ROBs are used in real processors. ------------------------------------------------------------------------ 2. a) CPI penalty = (%ConditionalBranches)*(Cond_Branch_Penalty) + (%UnconditionalBranches)*(Uncond_Branch_Penalty) = (%ConditionalBranches) * ( (%Accurate)(%Branch_Taken)(Penalty) + (%Innaccurate)(%Branch_Taken)(Penalty) + (%Innaccurate)(%Branch_Not_Taken)(Penalty) + (%Accurate)(%Branch_Not_Taken)(Penalty) ) + (%UnconditionalBranches)*(Uncond_Branch_Penalty) processor w/not-taken = (0.11)*((0.40)(0.60)(1) + (0.60)(0.60)(4) + (0.60)(0.40)(4) + (0.4)(0.4)(0)) + (0.02)*(1) = 0.11(0.24 + 1.44 + 0.96) + 0.02(1) = 0.31 CPI processor w/BTFN = (0.11)*((0.65)(0.60)(1) + (0.35)(0.60)(4) + (0.35)(0.40)(4) + (0.65)(0.4)(0)) + (0.02)*(1) = 0.11(0.39 + 0.84 + 0.56) + 0.02(1) = 0.22 CPI processor w/2-bit dynamic = (0.11)*((0.85)(0.60)(1) + (0.15)(0.60)(4) + (0.15)(0.40)(4) + (0.85)(0.40)(0)) + (0.02)*(1) = 0.11(0.51 + 0.36 + 0.24) + 0.02(1) = 0.14 CPI processor w/BTFN + BTB = (0.11)*((0.65)(0.60)(0) + (0.35)(0.60)(4) + (0.35)(0.40)(4) + (0.65)(0.40)(0)) + (0.02)*(0) = 0.11(0.84 + 0.56) = 0.15 CPI processor w/2-bit dynamic + BTB = (0.11)*((0.85)(0.60)(0) + (0.15)(0.60)(4) + (0.15)(0.40)(4) + (0.85)(0.40)(0)) + (0.02)*(0) = 0.11(0.36 + 0.24) = 0.066 CPI processor w/2 level = (0.11)*((0.95)(0.60)(1) + (0.05)(0.60)(4) + (0.05)(0.40)(4) + (0.95)(0.40)(0)) + (0.02)*(1) = 0.11(0.57 + 0.12 + 0.08) + 0.02(1) = 0.10 CPI processor w/2 level + BTB = (0.11)*((0.95)(0.60)(0) + (0.05)(0.60)(4) + (0.05)(0.40)(4) + (0.95)(0.40)(0)) + (0.02)*(0) = 0.11(0.12 + 0.08) = 0.022 CPI ------------------------------------------- b) Best cost/performance Range processor w/ not-taken > 5 CPI processor w/ BTFN 2 - 5 CPI processor w/ 2-bit dynamic None processor w/ BTFN + BTB None processor w/ 2-bit dynamic + BTB 1 - 2 CPI processor w/ 2 level None processor w/ 2 level + BTB 0 - 1 CPI ------------------------------------------- c) I don't have an answer yet -- I told you it was hard. ------------------------------------------------------------------------ 3) You have articles for the P6 and M1, and an MPR article describing the PA-8x00 is available on the interesting links page. I will try to make copies of good solutions available for photocopying, but I am not going to transcribe them here. I have at least one answer for every machine except the M2.
Prev | CPSC 418 Home Page | Next |