CPSC 418 --- Advanced Computer Architecture --- Spring 1996
Solution 10
General Information for This Homework
/* A : array [0 to 1023] of int (ints are 32 bits) */
for i = 0 to 1023 do
A[i] = 3*A[i+1];
Problem 1 (2 points)
Write the assembly code for both the Speed Demon and Brainiac processors.
Be sure to use your registers effectively and schedule the code for
optimum performance. Utilize the added features of the Brainiac when
they will increase performance.
Speed Demon
There are many different solutions. Two are shown here.
First possibility: direct translation.
; R1 : loop count
; R2 : &A (base address)
; R3 : &A[i]
; R6 : tmp
1000 LORx 0 NULL R1 ; initialize loop counter
loop : 1001 DLOAD 1(R3) R6 ; A[i+1] -> R6
1002 ADDx R1 R2 R3 ; &A[i] -> R3
1003 LMULx R6 3 R6 ; 3*A[i+1] -> R6
1004 DSTORE R6 (R3) ; 3*A[i+1] -> &R3
1005 LADDx R1 1 R1 ; increment loop count
1006 LSUBx R1 1023 NULL ; test termination
1007 JUMPLE loop
Second possiblity: rewrite the loop to decrement the counter:
cur = A[1024];
for i = 1023 to 0 do {
next = A[i];
A[i] = 3*cur;
cur = next;
}
which gives us the assembly code:
; R1 : loop count
; R2 : &A (base address)
; R3 : &A[i]
; R6 : cur
; R7 : next
1000 LORx 1023 NULL R1 ; initialize loop counter
1001 DLOAD 1024(R2) R6 ; A[1024] -> R6
loop : 1002 LMULx R6 3 R6 ; 3*A[i+1] -> R6
1003 ADDx R1 R2 R3 ; &A[i] -> R3
1004 DLOAD (R3) R7 ; A[i] -> R7
1005 DSTORE R6 (R3) ; 3*A[i+1] -> &R3
1006 ORx R7 NULL R6 ; R7 (next) -> R5 (cur)
1007 LSUBc R1 1 R1 ; test termination
1008 JUMPGE loop
The loop bodies of each possibility have 7 instructions. If we
increment the loop count (first possibility), we save one register,
so we choose that one.
Brainiac
For the brainiac, we can use indexed addressing mode and/or the JUMPCT
instruction. If we use JUMPCT, we will have to decrement the loop
counter:
; RBR : loop count
; R2 : &A (base address)
; R6 : cur
; R7 : next
1000 DLOAD 1024(R2) R6 ; A[1024] -> R6
1001 LOADRBR 1023 ; initialize loop counter
loop : 1002 LMULx R6 3 R6 ; 3*A[i+1] -> R6
1003 ILOAD (R2,RBR) R7 ; A[i] -> R7
1004 ISTORE R6 (R2,RBR) ; 3*A[i+1] -> A[i]
1005 ORx R7 NULL R6 ; R7 (next) -> R5 (cur)
1006 JUMPCT loop
This implementation of the loop body only has 5 instructions and uses
four registers, so we have saved two instructions and one register.
However, the ILOAD and ISTORE instructions spend two cycles in
Execute, rather than one. Thus the loop body will most likely require
7 cycles, just like for the speed demon. We do save one register
though, which is a good thing.
Problem 2 (8 points)
Draw execution tables for the initiation of the loop (setting up
register values), one iteration of the loop, and the conclusion of the
loop (testing loop count when it is done with the array) for each of
the two code fragments. Assume that all memory accesses in these
iterations get level-one cache hits.
Speed Demon
Brainiac
Problem 3 (4 points)
Calculate the average CPI for this loop (the full 1024 iterations) for
each microprocessor.
Formula
CPI =
= (T_Init+(#Iterations-1)*(T_Body)+T_Last) /
(#Inst_Init+(#Iterations-1)*(#Inst_Body)+#Inst_Last)
Speed Demon
CPI =
= (1 + (1024-1)*(9) + (11)) / (1 + (1024-1)*(7) + 7)
= 1.3
Brainiac
CPI =
= (2 + (1024-1)*(8) + (7)) / (2 + (1024-1)*(5) + 5)
= 1.6
Problem 4 (3 points)
What are the costs (e.g. where would extra hardware be needed, what
would increase the cycle time and why) of each of the Brainiac's
features. Be specific.
- Indexed addressing mode
-
- Performance cost: extra cycle in Execute stage
- Hardware cost: more complex control circuitry to allow
some Load/Store instructions to remain in Execute for two cycles
- RBR and JUMPCT
-
- Hardware cost: register for RBR, adder to decrement RBR, WAR
circuitry to detect JUMPCT or LOADRBR following an
instruction that reads from RBR
- Performance cost: the decrementer for RBR might not
increase cycle time, because it is much simpler than the
full ALU in Execute. The WAR circuitry might be complex
enough and the wires connecting RBR to EX might be long
enough that this would increase the cycle time. So, there
might not be any performance cost, or we could see a slight increase
in cycle time.
Problem 5 (2 points)
Is this example a good benchmark for comparing the Brainiac and Speed
Demon? Justify your answer.
No, this is not a good benchmark.
- The benchmark program is too short and the instruction mix does
not match that of most real programs.
- In the benchmark we assumed that everything we needed was in the
level-1 caches. Thus, we did not test the memory system of
either machine.
- Problem 3 only asked us to compare CPI. The only accurate
comparison is total execution time, which takes into account CPI,
instruction count, and cycle time.
Problem 6 (3 Points)
Can different branch-prediction techniques be combined?
If yes, which techniques can be used together and why?
If not, why not?
There are three broad categories of branch-prediction techniques:
- Static: always predict-taken or not-taken (or predict
based on sign of offset, as in Problem 7).
- Software: compiler calculates predict-taken/not-taken bits
in instruction
- History: branch-target-buffer maintains history of
behaviour of recently encountered branches
Conditional branches will occasionally miss in the branch-target
buffer (just like a cache miss). When this occurs, we need a backup
plan for how to predict whether to take the branch. Thus,
a history mechanism must be combined with either static prediction or
software-prediction.
All three of these techniques are typically implemented so that the
prediction is done in the Decode stage. Thus the only way to combine
different techniques is to give one technique priority over another.
Branch-history mechanisms are the most accurate prediction technique,
followed by software (assuming an intelligent compiler), followed by
static. Branch-history prediction must be combined with another
technique, and so we only need to decide if we can combine software
with static prediction. A software implementation could provide three
different flags: predict-taken, predict-not-taken, and no-prediction.
But it is not clear that this would be any gain to including the third
option, because static prediction could be emulated by just making the
software hint be predict-not-taken.
Conclusion: history mechanism must be combined with another technique.
Other combinations do not provide any increased performance or functionality.
Problem 7 (3 Points)
| Forward | Backward
|
---|
%Jump | 75% | 25%
|
---|
%Taken | 50% | 90%
|
---|
Problem 7(a) (1 Points)
Why does predicting backwards branches to be taken, as
opposed to predicting all branches as not-taken, result in a lower
total branch penalty? (where total branch penalty = branch penalty +
mispredict penalty) Justify your answer in terms of the behaviour
of typical programs.
Most backwards branches are caused by loops, and loops are typically
iterated through many times. Thus most backwards branches will be
taken.
Predicting a branch to be taken incurs a penalty for the
instructions that have entered the pipeline before we recognize the
branch and compute its destination. However, this penalty is
typically much less than the penalty for mispredicting a branch. So,
if we can predict branches sufficiently accurately, it is better to
predict them as taken and incur the predict-taken penalty on every
branch and the mispredict penalty on only a small number of branches.
Problem 7(b) (2 Points)
How much smaller is the total branch
penalty for predicting a branch based on the sign of the offset as
compared to predicting all branches as not-taken?
From Handout 14:
Pen_Tot = Pen_Bra + (%MisPred)(Pen_MisPred)
Also from the handout:
Penalty for predicting not-taken (Pen_NT) = 1.2
Penalty for predicting based on sign (Pen_SGN)
Pen_SGN =
= (%Fwd)*(%MisPred_Fwd)*(Pen_MisPred) +
(%Bwd)*(Pen_Bra + (%MisPred_Bwd)*(Pen_MisPred))
= (0.75)*(0.50)*(2) + (0.25)*(1 + (0.10)*(2))
= 1.05
Pen_SGN is n% less than Pen_NT =
= (Pen_NT - Pen_SGN) / Pen_SGN
= (1.2 - 1.05)/1.05
= 14%
Last modified: 27 Mar 96