CPSC 418 --- Advanced Computer Architecture --- Spring 1996
Homework 10 --- Due: 26 Mar 1996
General Information for This Homework
In this homework we will consider the following fragment of
pseudo-code:
/* A: array [0 to 1023] of int (ints are 32 bits) */
for i = 0 to 1023 do
A[i] = 3*A[i+1];
You will hand-compile the code for two different architectures and
compare the execution times for an implementation of each
architecture. The two architectures represent ``Speed Demon''
(e.g. DEC Alpha) and ``Brainiac'' (e.g. IBM Power) approaches to Risc
architectures. The pipelines for the two microprocessors are shown in
Figures 1 (speed demon) and 2 (brainiac).
Speed Demon Pipeline
Brainiac Pipeline
Information Common to Both Microprocessors
- Pipelines
-
- All ``normal'' arithmetic and logical instructions (e.g. ADD, SUB, XOR,
MULT, etc) use the Execute stage and then go directly to WriteBack.
-
Divide (DIV) instructions use Execute, and then spend two cycles in the
Divide stage, before going to WriteBack.
-
Load instructions instructions use Execute and then the
Load/Store stage before going to WriteBack.
- Store instructions use Execute and then the
Load/Store stage. They do not use WriteBack.
- Each stage can contain at most one instruction.
- ``BY1'' is the register for the bypass from the output of
Execute to its input.
- ``BY2'' is the register for the bypass from the output of
Load/Store to the input of Execute.
- ``S1'' and ``S2'' are the registers where the register file
puts the values currently stored in the registers corresponding
to the operands in the FetchOps stage.
- There is no bypass path from the output of Divide to the
input of Execute. Divide instructions must be written
to the register file before they can be used.
- Branch prediction occurs in the Decode stage of
the pipeline. Therefore, if a branch is predicted to
be taken, the instruction in the Fetch Instruction
stage is nullified (killed) and the PC is updated to the branch
destination. This is independent of
nullifying instructions after a branch mispredict.
- Memory
-
- A level one cache hit has a latency of one cycle (instructions
spend one cycle in Load/Store).
- All of the instructions are in the I cache.
- The array A starts at address 0x2000 and
is in the level 1 cache.
- Registers
-
For both architectures, assume that the code fragment for
this homework can use seven registers. (Other registers are
allocated for other variables, function arguments
and return values, stack pointer, etc).
``Speed Demon'' Architecture and Microprocessor
- Branch instructions test a general register
(e.g. JUMPEQ R3, 300)
- Branches are resolved in the Execute stage.
- Branches with negative offsets (earlier locations in code)
are predicted to be taken.
- Branches with positive offsets
(later locations in code) are predicted not taken.
- All Load/Store instructions use displacement addressing mode
(e.g. 100(R3) = M[100+R3]).
- Processor speed = 300MHz
``Brainiac'' Architecture and Microprocessor
- Load/Store instructions use either displacement (e.g.
100(R3) = M[100+R3]) or
indexed
(e.g. 100(R3, R4) = M[100+R3+R4]) addressing modes.
Load/Stores using indexed addressing mode spend two cycles
in Execute.
- RBR is a special decrement-and-test register, which is used
with the JUMPCT instruction. The instruction JUMPCT
imm
decrements RB and branches to PC+imm if RBR is
non-negative after decrementing. RBR
is decremented and tested in the Decode stage and the branch is
resolved in this stage. Your code fragment may use RBR,
and this counts as one of the seven registers available.
- There is a special load-rbr-immediate
instruction (LOADBR), which is used to load RBR with a loop
count at the beginning of a loop. This load is done in the
Decode stage.
- RBR can be used as a source operand to Execute, but not
as a destination.
- Branches are statically predicted to be not-taken.
- Processor speed = 166MHz
Problem 1 (2 points)
Write the assembly code for both the Speed Demon and Brainiac processors.
Be sure to use your registers effectively and schedule the code for
optimum performance. Utilize the added features of the Brainiac when
they will increase performance.
Problem 2 (8 points)
Draw execution tables for the initiation of the loop (setting up
register values), one iteration of the loop, and the conclusion of the
loop (testing loop count when it is done with the array) for each of
the two code fragments. Assume that all memory accesses in these
iterations get level-one cache hits.
NOTE You may draw simplified tables (i.e. don't show how the
instructions are decoded, etc) if you wish. But, make sure that there
is sufficient information to give you partial credit in case you don't
get everything correct.
Problem 3 (4 points)
Calculate the average CPI for this loop (the full 1024 iterations) for
each microprocessor.
Problem 4 (3 points)
What are the costs (e.g. where would extra hardware be needed, what
would increase the cycle time and why) of each of the Brainiac's
features. Be specific.
Problem 5 (2 points)
Is this example a good benchmark for comparing the Brainiac and Speed
Demon? Justify your answer.
Problem 6 (3 Points)
Can different branch-prediction techniques be combined?
If yes, which techniques can be used together and why?
If not, why not?
Problem 7 (3 Points)
| Forward | Backward
|
---|
%Jump | 75% | 25%
|
---|
%Taken | 50% | 90%
|
---|
Problem 7(a) (1 Points)
Why does predicting backwards branches to be taken, as
opposed to predicting all branches as not-taken, result in a lower
total branch penalty? (where total branch penalty = branch penalty +
mispredict penalty) Justify your answer in terms of the behaviour
of typical programs.
Problem 7(b) (2 Points)
How much smaller is the total branch
penalty for predicting a branch based on the sign of the offset as
compared to predicting all branches as not-taken?
Problem 8 (1 Point)
How much time did you spend total on the assignment and on each of
these problems? Which problems were useful or useless? How well
did the lectures and text book prepare you for this homework?
Which concepts needed better explanation?
Last modified: 18 Mar 96