CPSC 418: Assignment #2

CPSC 418 Home Page
------------------------------------------------------------------------------
			HOMEWORK ASSIGNMENT #2

Out: Thurs Feb 13/97					Due: Tues Mar 4/97
CS 418							Spring Session 1997
Marks: 100
------------------------------------------------------------------------------

1) Hazards and the simple pipeline

For each of the following, identify the type of hazard and explain how
the basic MIPS pipeline (IF ID ALU MEM WB) deals with that hazard.
For instruction sequences, remember that the first instruction will
only be one stage ahead of the second in the pipeline.  Each
instruction sequence can be considered to be surrounded by
non-interfering instructions.

a)	Both the instruction and data caches are physically addressed;
consequently, an address translation must occur every cycle for the
PC, and on every load or store.  In order to avoid a hazard on every
load or store, the MIPS holds the translation of the last PC in two
special registers -- if the next instruction is on the same page as
the last, there is no need to consult the TLB to fetch the
instruction.  The TLB is thus left free for a possible data
load/store.
	If the next instruction is on a different page, the TLB must
be referenced.  A hazard develops if a load/store is in MEM on the
same cycle.  Characterize this hazard, and how MIPS handles it.

b)	The instruction sequence
		ADD	r2, r3, r4
		ADDI	r5, r2, #-10

c)	The instruction sequence
		SW	r2, 8(r3)
		ADDI	r3, r4, #-10

d)	The instruction sequence
		LW	r2, 8(r3)
		ADDI	r4, r2, #8

e)	The instruction sequence
		ADD	r2, r3, r4
		ADDI	r2, r3, #-12

f) The MIPS branch instruction is allowed to compare two GPRs
(accomplished by subtracting the two and comparing to zero).  This
comparison occurs during ID.  Ignoring PC update, find the hazard in
the following sequence, and explain MIPS's solution:
		ADDI	r4, r5, #10
		BREQ	r2, r3, #0x0010
		
g) The result of a branch is not known until the ID stage.  What kind
of hazard does this introduce, and how is it dealt with?

h) On a taken branch, the PC must be updated.  If the result of a
conditional branch is discovered during ID, then when the branch
passed into the ALU stage the PC could be updated; however, waiting
until the ALU will worsen the hazard from part (g).  What hazard is
introduced by moving the calculation of the new PC back to ID, and how
is it dealt with?

------------------------------------------------------------------------------

2) Dispatch, Issue, Completion & Register Renaming

An imaginary load/store architecture has the following instruction
statistics:
	instruction type	      latency	stages between ID & WB
	basic integer & stores		 1		 2
	loads & branches		 2		 2
	integer multiply		 5		 5
	FP add & subtract		 3		 3
	FP multiply			 5		 5

Latencies are the number of cycles between when the instruction begins
executing, and when another instruction using the first's results may
begin executing.  For example, with the instructions
	multf	r3, r2, r2
	addf	r4, r3, r2
The addf could enter its execute stage 5 cycles after the multf
entered its execute stage (the addf could have been fetched and
decoded during these cycles though):

Cycle -->  1    2    3    4    5    6    7    8    9   10   11
multf      IF   ID  E*1  E*2  E*3  E*4  E*5   WB
addf            IF   ID   -    -    -    -   E+1  E+2  E+3  WB

where - is a stall between dispatch and issue.

Stages between ID & WB shows the number of stages in the execution
pipeline (ie ALU & MEM for basic integer, loads, stores and branches).
Each stage takes one cycle to complete.  If the latency is less than
the stages between ID & WB, then some kind of special hazard avoidance
hardware has been added for that instruction.

For the rest of this question, consider the following code sequence
(for simplicity, I don't differentiate between FP and integer
registers in this question):

	multf	r1, r2, r2
	lf	r3, 64(r12)
	subf	r4, r2, r5
	multf	r6, r1, r1
	multf	r7, r5, r2
	lf	r6, 80(r12)
	beqz	r8, GOOD	; assume this branch is NOT taken (ie r8 != 0)
	subf	r6, r5, r7
GOOD:	addi	r12, r12, #256
	addf	r9, r6, r6

For all the processors, you may assume the simplest branch prediction
(ie predict not taken) which will be correct for the branch above.
Therefore, the branch will not cause a control hazard in any of the
machines below.

----------------------------
a) Identify all the potential data hazards (imagine you tried to
execute all instructions at once).

----------------------------
b) A simple pipeline implementation has three fully pipelined
execution units:
		integer (stages called ALU & MEM)
		FP add & subtract (stages E+1, E+2 & E+3)
		FP or integer multiply (stages E*1 thru E*5)
Instructions can be executing in parallel on these units; however, all
instructions share IF ID and WB.  No two instructions may be in the
same pipeline stage (eg. two instructions could not both be in E+1 at
the same time, nor could they both be in ID or any other single stage).

The pipeline imposes single in-order dispatch and issue (instructions
must enter execution in-order, even if they are entering different
execute pipelines).

i) Of what type is the new hazard that could be introduced by the
sharing of IF ID and WB?


ii) Assume that the processor will also stall issue if it detects
that an instruction will not finish in-order.  For example, in the
following code the add will need to be stalled long enough that it
will finish after the multiply (despite the fact they are
independent instructions):
		multf	r3, r2, r2
		addf	r7, r6, r5

	Cycle -->  1    2    3    4    5    6    7    8    9
	multf      IF   ID  E*1  E*2  E*3  E*4  E*5   WB
	addf            IF   ID   -    -   E+1  E+2  E+3   WB
Note that once issued into an execution pipeline, an instruction will
not stall again.  What types of hazards does this in-order completion
rule remove?

iii) How many cycles does the sequence take to execute with single
in-order dispatch, issue and completion? (show the execution cycle by
cycle, as shown in the two example executions -- it may take many
cycles).  Don't forget that two instructions may not be in the same
stage at the same cycle.

----------------------------
c) A different pipeline allows out-of-order completion.  The
instruction pair from (2 b iii) will now execute as
	Cycle -->  1    2    3    4    5    6    7    8
	multf      IF   ID  E*1  E*2  E*3  E*4  E*5   WB
	addf            IF   ID  E+1  E+2  E+3   WB
However, pairs of instructions with the hazards listed in (2 b ii)
will still need to stall before issue so that they complete in-order
(for that pair) and avoid the hazard.

How many cycles does the sequence take to complete with single
in-order dispatch and issue, but out-of-order completion? (show the
execution)

----------------------------
d) In order to avoid the data hazards, a new processor has implemented
a limited form of register renaming.  Each architected register has
only two shadow registers, and can never point to another architected
register's shadows.  Therefore, an architected register may never have
more than two different values at a time.

For example, architected register r4 has two shadow registers s4a and
s4b.  Even if architected register r5 is currently pointed at s5a and
is not using s5b, r4 may only point to s4a or s4b (and r4 must always
point to one or the other).

Remember that a shadow register cannot be reused until it has been
read for the last time, and the instruction which logically overwrites
it (ie overwrites the architected register to which the shadow was
mapped) has finished WB.

i) Assuming that the logical registers all initially point to the s*a
shadows (where * is the register number), and that all the s*b
registers are on the free list, rewrite the register references in the
code sequence to use the shadow registers.

ii) Identify which hazards cannot be removed and explain why.

iii) The remaining hazards must still complete in-order.  How many
cycles will the new sequence take, assuming in-order dispatch and
issue and out-of-order (except for hazards) completion? (show the
execution)

iv) Where do the architected registers point at the end of the
execution?

----------------------------
e) Imagine that an unlimited number of shadow registers are provided
(for architected register r4, call then s4a, s4b, s4c, s4d, ...)

i) If you were to rewrite the code using these unlimited shadow
registers, would the result be different from (2 d i)?  Why or why
not?

ii) What types of hazards remain?

----------------------------
f) Now we will provide out-of-order issue, probably using reservation
stations.  With unlimited shadow registers, how many cycles will the
new sequence take, assuming in-order dispatch, but out-of-order issue
and (except for remaining hazards) completion? (show the execution)

----------------------------
g) Assuming that no attempt to reorder results prior to commit is
made, at which cycles in this execution stream could an imprecise
interrupt occur?

----------------------------
h) Now imagine a dual-issue processor.  Two IF, ID and WB stages are
provided, which can operate in parallel; the execute stages remain the
same (ie you may begin execution of an integer instruction and an FP
add at the same time, but not two integer instructions).  Unlimited
shadow registers are provided.

With in-order dispatch and out-of-order issue and completion (except
for remaining hazards), how many cycles would the new sequence take to
execute?  (show the execution)

------------------------------------------------------------------------------

3) Reorder Buffer and Reservation Stations

Consider the following sequence of instructions:

	loadf	f4, 0(r4)
	loadf	f5, 0(r5)
	multf	f3, f4, f5
	addf	f6, f3, f4
	divf	f4, f7, f5
	subf	f7, f7, f5	

For this question, the machine can dispatch a single instruction per
cycle.  There are 3 execution pipelines (integer, FP add/sub, FP
mult/div), which have the following depths (each of the three is fully
pipelined, and each can support issue of one instr/cycle):

instr. type		pipeline	execution cycles
loads			integer			2
FP add & sub		FP a/s			2
FP mult			FP m/d			10
FP div			FP m/d			40

A reorder buffer is used for register renaming and to maintain precise
interrupts.  Reservation stations are used between dispatch and the
end of execution to keep track of missing operands.


When the second load has completed execution, internal tables are in
the following state:

INSTRUCTION STATUS:
Instruction          Dispatch Issue  Execute  Commit
 loadf			X	X	X	X
 loadf			X	X	X
 multf			X	
 addf			X
 divf
 subf

RESERVATION STATIONS:
Name          Busy	Op    source1	data1 valid1  source2	data2 valid2
FP a/s1		X	addf	f4	(f4)	X	rob1		
FP a/s2
FP a/s3
FP m/d1		X	multf	rob0			f4	(f4)	X
FP m/d2



REGISTER FILE: (either a register is valid, or it points to a rob entry)
Register      Valid   ROB Entry
  f3			rob1
  f4		X
  f5			rob0
  f6			rob2
  f7		X

REORDER BUFFER: (source is a reservation station)
head: rob0 (next instruction to commit)
tail: rob3 (slot for next instruction to dispatch)
Entry  Busy   Valid	Source
 rob0	X	X	(load2)
 rob1	X		(mult1)
 rob2	X		(add1)
 rob3


Notes:

1) The integer pipeline reservation stations have not been shown --
they have just emptied with the completion of the second load and will
not be used again.

2) The instruction status table is not actually implemented in the
hardware -- equivalent data is distributed in the other tables and the
PC.  Its inclusion makes answering and marking easier, so fill it out.

3) Successful dispatch requires an appropriate free reservation
station (to hold operands) and a free reorder buffer entry (as a
holding space for the result).  If either are not available, an
instruction will stall in dispatch until both become available.

4) Issue requires that both an instruction's operands be valid.  Any
instruction in a reservation station with two valid operands may
issue.

5) An instruction holds on to its reservation station until it
finishes executing and sets the Valid flag in its rob entry.  For
example, the second load has just set the Valid flag in rob0, and so
it has released its reservation station.  Next cycle, the multf will
be able to issue (its second operand has just become valid), but it
will not release its reservation station for 10 cycles (when it will
complete).

6) Remember that the reorder buffer is managed round-robin; the next
entry after rob3 is rob0.

7) When the entry at the head of the reorder buffer is valid, it is
committed and the next entry becomes the head.


Fill out a similar set of 4 tables at each of the following points of
execution.

a) When the subf has just dispatched.

b) When the subf has just finished execution.

c) When the addf has just committed.

------------------------------------------------------------------------------
CPSC 418 Home Page
Last modified: Feb 97