CPSC 418: Assignment #3 Solutions

CPSC 418 Home Page
CPSC 418 - Homework #3 Solution

1. a)

Unfortunately, the answer depended on how many execution units you
assumed.  When I wrote the question originally, I was assuming
unlimited execution units.  In that case, the solution is:

Result Shift Register

Stage	Functional Unit Source	Valid	Tag
-----	----------------------	-----	---
1	Integer Add		  1	 6
2				  0	
3				  0
4				  0
5				  0
6				  0
7				  0
8				  0
9				  0
10				  0


Reorder Buffer

	Entry	Dest.					Program
	Number	Reg	Result	Exceptions	Valid	Counter
	------	-----	------	----------	-----	-------
	1	(R0)				  	  (1)
	2	(R0)				  	  (2)
	3	(R0)				  	  (3)
Head->	4	 R2		    X		  1	   4
	5	 R1		    X		  1	   5
	6        R6				  	   6
	7	 R0					   7
Tail->	8	


Notes:

- 10 stages are required in the RSR because the maximum instruction
  length is 10 (for a divide).

- The first 3 instructions have all finished and clear the ROB -- ROB
  entries 1-3 should really be empty, but I've shown their former
  contents.

- All instructions were already in the ROB at the beginning, since ROB
  entries are allocated on dispatch and in program order.

- The multiply was independent of all of instructions, so it has
  completed by the time the first load finishes.  Its exception has been
  noted in the ROB, but cannot be raised until it gets to the head of
  the ROB.

- The first load just exitted the RSR, and so can raise its exception

- The second load is now at the head of the RSR, and its exception
  would be noted this cycle if the first load hadn't just raised its
  exception (halting further execution).

- The final addi has been unable to issue because the top entry of the
  RSR (the entry it must reserve in order to issue) has been full every
  cycle since the time the last RAW hazard was resolved.


You could reasonably have assumed a different number of execution
units.  Here are the reasonable assumptions I can think of, and how
they change the solution:

i) Separate pipelines for the mult, loads and adds (so 3 pipelines)

The mult gets to go right away, and the addi still cannot issue
because it cannot reserve a spot in the RSR.  The solution is the same
as that given above.

ii) A separate pipeline for the mult instruction

The mult gets to go right away, and so the solution is the same as
that given above.

iii) A single execution unit for all instructions.

In this case, the instructions issue in program order, and the mult is
significantly delayed.  The mult will issue the first cycle after the
first lw.  The second cycle after the first lw issues, its exception
will be raised.  The RSR will contain only the mult in stage 3 -- the
second lw will not have issued.  Only the first lw's exception will
have been noted by the time it is raised.

-------------------------------------------

b)
  i) ID stage - Disk I/O is an external interrupt that can be checked
                at the issue stage, so instruction issuing can be
                halted and all previously issued instructions can
                complete.

 ii) ROB - Operands are not actually looked at when they are fetched in
           the ID stage and thus it is only possible to raise exception
           in the ROB.

iii) ID stage - Breakpoints are trap instructions that can be detected
                in the ID stage.

 iv) ROB - Don't know until the ALU stage that it is an unaligned access
           so it can only be raised in the ROB.

-------------------------------------------

c)  History Buffer
	 32 interconnects : From execution unit to register file
	 96 interconnects : From register file to ID stage
	                    (3 read ports for each register)
	 32 interconnects : From History buffer to register file
	                    (a write port from the tail to all registers
	                    to commit values)
	---
Total = 160 interconnects


    Future File
	 64 interconnects : From future file to ID stage
	                    (2 read ports for each register)
	 32 interconnects : From ROB to architectural file
	 32 interconnects : From architectural file to future file
	  8 interconnects : From execution unit to ROB
	 32 interconnects : From execution unit to future file
	---
Total = 168 interconnects

This is not a very good question -- one mark was given for each case
as long as your number of interconnects for the history buffer and
future file were significantly less than the number of interconnects
for the ROB with bypassing.

Justin and Greg both pointed out to me that real processors will use
some kind of bus system, reducing the relative cost of ROBs and
explaining why ROBs are used in real processors.

------------------------------------------------------------------------
2. a)

CPI penalty = (%ConditionalBranches)*(Cond_Branch_Penalty) +
	      (%UnconditionalBranches)*(Uncond_Branch_Penalty)

	    = (%ConditionalBranches) *
		( (%Accurate)(%Branch_Taken)(Penalty) +
		  (%Innaccurate)(%Branch_Taken)(Penalty) +
		  (%Innaccurate)(%Branch_Not_Taken)(Penalty) +
		  (%Accurate)(%Branch_Not_Taken)(Penalty) ) +
	      (%UnconditionalBranches)*(Uncond_Branch_Penalty)

processor w/not-taken
  = (0.11)*((0.40)(0.60)(1) + (0.60)(0.60)(4) + (0.60)(0.40)(4) +
	(0.4)(0.4)(0)) + (0.02)*(1)
  = 0.11(0.24 + 1.44 + 0.96) + 0.02(1)
  = 0.31 CPI

processor w/BTFN
  = (0.11)*((0.65)(0.60)(1) + (0.35)(0.60)(4) + (0.35)(0.40)(4) +
	(0.65)(0.4)(0)) + (0.02)*(1)
  = 0.11(0.39 + 0.84 + 0.56) + 0.02(1)
  = 0.22 CPI

processor w/2-bit dynamic
  = (0.11)*((0.85)(0.60)(1) + (0.15)(0.60)(4) + (0.15)(0.40)(4) +
	(0.85)(0.40)(0)) + (0.02)*(1)
  = 0.11(0.51 + 0.36 + 0.24) + 0.02(1)
  = 0.14 CPI

processor w/BTFN + BTB
  = (0.11)*((0.65)(0.60)(0) + (0.35)(0.60)(4) + (0.35)(0.40)(4) +
	(0.65)(0.40)(0)) + (0.02)*(0)
  = 0.11(0.84 + 0.56)
  = 0.15 CPI

processor w/2-bit dynamic + BTB
  = (0.11)*((0.85)(0.60)(0) + (0.15)(0.60)(4) + (0.15)(0.40)(4) +
	(0.85)(0.40)(0)) + (0.02)*(0)
  = 0.11(0.36 + 0.24)
  = 0.066 CPI

processor w/2 level
  = (0.11)*((0.95)(0.60)(1) + (0.05)(0.60)(4) + (0.05)(0.40)(4) +
	(0.95)(0.40)(0)) + (0.02)*(1)
  = 0.11(0.57 + 0.12 + 0.08) + 0.02(1)
  = 0.10 CPI

processor w/2 level + BTB
  = (0.11)*((0.95)(0.60)(0) + (0.05)(0.60)(4) + (0.05)(0.40)(4) +
	(0.95)(0.40)(0)) + (0.02)*(0)
  = 0.11(0.12 + 0.08)
  = 0.022 CPI

-------------------------------------------

b) 
                              Best cost/performance Range
processor w/ not-taken			> 5 CPI
processor w/ BTFN			2 - 5 CPI
processor w/ 2-bit dynamic		None
processor w/ BTFN + BTB			None
processor w/ 2-bit dynamic + BTB	1 - 2 CPI
processor w/ 2 level			None
processor w/ 2 level + BTB		0 - 1 CPI

-------------------------------------------

c) I don't have an answer yet -- I told you it was hard.

------------------------------------------------------------------------

3) You have articles for the P6 and M1, and an MPR article describing
the PA-8x00 is available on the interesting links page.  I will try to
make copies of good solutions available for photocopying, but I am not
going to transcribe them here.  I have at least one answer for every
machine except the M2.
CPSC 418 Home Page
Last modified: March 97