CPSC 418: Assignment #3

CPSC 418 Home Page
------------------------------------------------------------------------------
			HOMEWORK ASSIGNMENT #3

Out: Tues Mar 4						Due: Tues Mar 18/97
CS 418							Spring Session 1997
Marks: 100
------------------------------------------------------------------------------

Question 1: Precise Interrupts (20 marks)

a) Consider the following code sequence:

	PC	Instruction
	1	add	r0, r3, r4
	2	add	r0, r0, r5
	3	add	r0, r0, r6
	4	lw	r2, 64(r0)
	5	mult	r1, r3, r5
	6	lw	r6, -128(r0)
	7	addi	r0, r0, #1024

Assume that all instructions have been dispatched to reservation
stations, and that they may issue out-of-order.  Instructions take 1
cycle to execute, other than loads (2 cycles), mult (4 cycles) and div
(10 cycles).

Use the Result Shift Register and Reorder Buffer style shown in Fig 3
of Smith & Pleszkun (p.566).  Assume the ROB includes bypassing.  Note
that exceptions are not raised until an instruction's ROB entry is at
the head, and that instruction has exited the RSR -- so an instruction
with latency k cannot raise an exception until at least k cycles after
issue, regardless of when the exception was detected.

Assume that both lw have page faults (detected 1 cycle into
execution), and the mult has an overflow (detected 3 cycles into
execution).  Show the RSR and ROB when the first exception is raised
by the ROB.

----------------------------------
b) Which of the following interrupts would be best raised by the ROB,
and which best raised by the ID stage?

	i)   disk I/O
	ii)  square root of negative FP value
	iii) debugger breakpoint
	iv)  unaligned load access on Alpha

Explain your answer for each.

----------------------------------
c) The basic ROB technique in Smith & Pleszkun uses bypassing to make
ROB data available to instructions.  Bypassing requires that every
register be able to get its value from every ROB entry.  For 32
registers and 8 ROB entries, this will require 256 interconnects.
Those are in addition to the 64 interconntects from register file to
ID stage (2 read ports for each register), 32 interconnects from ROB
to register file (a write port from the tail to all registers to
commit values), and 8 interconnects from the execution unit to ROB
(the top of RSR specifies which unit writes to which ROB entry each
cycle).  Total number of interconnects: 360.

Separate interconnects are only needed when a circuit may be used
every cycle.  Since ID can read any register, and any register may be
overwritten by any ROB entry, a full set of interconnects is needed
(8*32 connects).  But for retiring instructions, only one tail entry
can write to the register file each cycle, so only one write
connection from the ROB to the register file is required for each
register (32 connects)

Perform a similar computation for the functionally equivalent history
buffer and future file techniques in Smith & Pleszkun.  How many
interconnects does each need?

------------------------------------------------------------------------------

Question 2: Branch Prediction / Control Hazards (20 marks)

Consider a basic pipelined architecture which uses condition codes to
implement conditional branches.  When a branch instruction is
encountered, the following penalties occur:

Conditional Branch correctly predicted taken or Unconditional Branch:
	1 cycle (to redirect fetch stream)
Conditional Branch correctly predicted not taken:
	0 cycles
Conditional Branch incorrectly predicted:
	4 cycles (including redirect)

We will assume 11% of instructions are conditional branches, and 2%
are unconditional (taken from DLX measurements in HP p.181).  For the
effectiveness of prediction strategies, you may use the numbers in
MPR's "New Algorithm Improves Branch Prediction" Figure 4.  For
example, assume that 60% of the conditional branches encountered in
practice are taken.

----------------------------------
a) We are now given a choice between several different processors:
				      redirect	     accuracy	
				       penalty			
processor w/ not-taken          	1		40%	
processor w/ BTFN               	1		65%	
processor w/ 2-bit dynamic      	1		85%	
processor w/ BTFN + BTB         	0		65%	
processor w/ 2-bit dynamic + BTB	0		85%	
processor w/ 2 level            	1		95%	
processor w/ 2 level + BTB      	0		95%	

For each processor, calculate the average CPI penalty on all
instructions caused by control hazards.

----------------------------------

b) For this part, assume that you calculated the following CPI
penalties due to branching for each of the processors in (a) (these
are NOT the actual values you will get for (a) -- they are
significantly inflated)

				       branch  		cost
            			       penalty
processor w/ not-taken                  2.50		$50
processor w/ BTFN                       1.25		$60
processor w/ 2-bit dynamic              1.15		$65
processor w/ BTFN + BTB                	1.02		$75
processor w/ 2-bit dynamic + BTB        0.44		$80
processor w/ 2 level                    0.58		$80
processor w/ 2 level + BTB              0.28		$90

Obviously, if the processor has a very high base CPI (CPI not
including the branch penalties calculated above), then the processor
w/ not-taken prediction will have the best cost-performance -- who
cares about a few tenths of a cycle lost due to poor branch prediction
when each instruction takes 10 cycles.

Likewise, for an extremely small base CPI, the processor w/ 2 level +
BTB prediction will have the best cost-performance -- if there were
ten instructions executed per cycle, a tenth of a cycle penalty is a
big deal.

Calculate the range of CPIs (if any) in which each processor has the
best cost-performance.  This question is considerably simplified by
using some kind of plotting program.

These tradeoff points are not meant to reflect actual processors.

----------------------------------
c) (15 bonus marks -- this question is hard!) Without a Branch Target
Buffer (BTB), there is still a penalty for correctly predicted taken
branches.  Therefore, the most efficient prediction algorithm may not
try to maximize its prediction percentage -- rather it will try to
minimize the CPI penalty.

Assume that a piece of hardware can produce the percentage chance that
a branch will be taken (possibly based previous history, possibly
based on compiler hints).  To maximize prediction efficiency, we would
predict the branch taken as soon as this value was above 50%.

After taking account of fetch stream redirect penalties, above what
percentage value should we predict taken to minimize CPI penalty?

------------------------------------------------------------------------------

Question 3: Actual Architectures (60 marks)

We have studied six major issues that a designer must address to build
an effective superscalar dynamically scheduled processor:

	branch prediction
	removing WAR and WAW hazards
	resolving RAW hazards
	executing multiple instructions (ie what are the execution units?)
	maintaining precise interrupts (or not)
	avoiding performance loss due to cache misses

In class or Smith & Sohi, we have examined how these issues are dealt
with in the Alpha 21164, MIPS R10000, AMD K5, and Alpha 21264.

a) Pick one of the Intel Pentium Pro, Cyrix 6x86 (M1), Cyrix M2, or
AMD K6.  For that processor, explain how each of the six issues is
dealt with.

b) Pick one of the Sun UltraSparc, HP PA-8x00, PowerPC 604, or PowerPC
620.  For that processor, explain how each of the six issues is dealt
with.

I am open to other suggestions of processors, but talk to me before
you start writing.

Each of your answers for this question should be in a form similar to
the processor descriptions in Smith & Sohi, but about half the length
(400 - 500 words).  For each answer, please cite your sources.

------------------------------------------------------------------------------

Question 4: Evaluation

For each of the questions 1-3, how long did it take to answer?

Do you have any other comments about the assignment?

------------------------------------------------------------------------------
CPSC 418 Home Page
Last modified: March 97