CPSC 418: Solution 10

Home

Suggestion Box

CPSC 418 --- Advanced Computer Architecture --- Spring 1996
Solution 10

General Information for This Homework

/* A : array [0 to 1023] of int (ints are 32 bits) */ for i = 0 to 1023 do A[i] = 3*A[i+1];

Problem 1 (2 points)

Write the assembly code for both the Speed Demon and Brainiac processors. Be sure to use your registers effectively and schedule the code for optimum performance. Utilize the added features of the Brainiac when they will increase performance.

Speed Demon

There are many different solutions. Two are shown here.

First possibility: direct translation.

; R1 : loop count ; R2 : &A (base address) ; R3 : &A[i] ; R6 : tmp 1000 LORx 0 NULL R1 ; initialize loop counter loop : 1001 DLOAD 1(R3) R6 ; A[i+1] -> R6 1002 ADDx R1 R2 R3 ; &A[i] -> R3 1003 LMULx R6 3 R6 ; 3*A[i+1] -> R6 1004 DSTORE R6 (R3) ; 3*A[i+1] -> &R3 1005 LADDx R1 1 R1 ; increment loop count 1006 LSUBx R1 1023 NULL ; test termination 1007 JUMPLE loop Second possiblity: rewrite the loop to decrement the counter: cur = A[1024]; for i = 1023 to 0 do { next = A[i]; A[i] = 3*cur; cur = next; } which gives us the assembly code: ; R1 : loop count ; R2 : &A (base address) ; R3 : &A[i] ; R6 : cur ; R7 : next 1000 LORx 1023 NULL R1 ; initialize loop counter
1001 DLOAD 1024(R2) R6 ; A[1024] -> R6 loop : 1002 LMULx R6 3 R6 ; 3*A[i+1] -> R6 1003 ADDx R1 R2 R3 ; &A[i] -> R3 1004 DLOAD (R3) R7 ; A[i] -> R7 1005 DSTORE R6 (R3) ; 3*A[i+1] -> &R3 1006 ORx R7 NULL R6 ; R7 (next) -> R5 (cur) 1007 LSUBc R1 1 R1 ; test termination 1008 JUMPGE loop The loop bodies of each possibility have 7 instructions. If we increment the loop count (first possibility), we save one register, so we choose that one.

Brainiac

For the brainiac, we can use indexed addressing mode and/or the JUMPCT instruction. If we use JUMPCT, we will have to decrement the loop counter: ; RBR : loop count ; R2 : &A (base address) ; R6 : cur ; R7 : next 1000 DLOAD 1024(R2) R6 ; A[1024] -> R6 1001 LOADRBR 1023 ; initialize loop counter loop : 1002 LMULx R6 3 R6 ; 3*A[i+1] -> R6 1003 ILOAD (R2,RBR) R7 ; A[i] -> R7 1004 ISTORE R6 (R2,RBR) ; 3*A[i+1] -> A[i] 1005 ORx R7 NULL R6 ; R7 (next) -> R5 (cur) 1006 JUMPCT loop This implementation of the loop body only has 5 instructions and uses four registers, so we have saved two instructions and one register. However, the ILOAD and ISTORE instructions spend two cycles in Execute, rather than one. Thus the loop body will most likely require 7 cycles, just like for the speed demon. We do save one register though, which is a good thing.

Problem 2 (8 points)

Draw execution tables for the initiation of the loop (setting up register values), one iteration of the loop, and the conclusion of the loop (testing loop count when it is done with the array) for each of the two code fragments. Assume that all memory accesses in these iterations get level-one cache hits.

Speed Demon

Brainiac

Problem 3 (4 points)

Calculate the average CPI for this loop (the full 1024 iterations) for each microprocessor.

Formula

CPI =
= (T_Init+(#Iterations-1)*(T_Body)+T_Last) / (#Inst_Init+(#Iterations-1)*(#Inst_Body)+#Inst_Last)

Speed Demon

CPI =
= (1 + (1024-1)*(9) + (11)) / (1 + (1024-1)*(7) + 7)
= 1.3

Brainiac

CPI =
= (2 + (1024-1)*(8) + (7)) / (2 + (1024-1)*(5) + 5)
= 1.6

Problem 4 (3 points)

What are the costs (e.g. where would extra hardware be needed, what would increase the cycle time and why) of each of the Brainiac's features. Be specific.

Indexed addressing mode

Performance cost: extra cycle in Execute stage
Hardware cost: more complex control circuitry to allow some Load/Store instructions to remain in Execute for two cycles

RBR and JUMPCT

Hardware cost: register for RBR, adder to decrement RBR, WAR circuitry to detect JUMPCT or LOADRBR following an instruction that reads from RBR
Performance cost: the decrementer for RBR might not increase cycle time, because it is much simpler than the full ALU in Execute. The WAR circuitry might be complex enough and the wires connecting RBR to EX might be long enough that this would increase the cycle time. So, there might not be any performance cost, or we could see a slight increase in cycle time.

Problem 5 (2 points)

Is this example a good benchmark for comparing the Brainiac and Speed Demon? Justify your answer.

No, this is not a good benchmark.

The benchmark program is too short and the instruction mix does not match that of most real programs.
In the benchmark we assumed that everything we needed was in the level-1 caches. Thus, we did not test the memory system of either machine.
Problem 3 only asked us to compare CPI. The only accurate comparison is total execution time, which takes into account CPI, instruction count, and cycle time.

Problem 6 (3 Points)

Can different branch-prediction techniques be combined? If yes, which techniques can be used together and why? If not, why not?

There are three broad categories of branch-prediction techniques:

Static: always predict-taken or not-taken (or predict based on sign of offset, as in Problem 7).
Software: compiler calculates predict-taken/not-taken bits in instruction
History: branch-target-buffer maintains history of behaviour of recently encountered branches

Conditional branches will occasionally miss in the branch-target buffer (just like a cache miss). When this occurs, we need a backup plan for how to predict whether to take the branch. Thus, a history mechanism must be combined with either static prediction or software-prediction.

All three of these techniques are typically implemented so that the prediction is done in the Decode stage. Thus the only way to combine different techniques is to give one technique priority over another. Branch-history mechanisms are the most accurate prediction technique, followed by software (assuming an intelligent compiler), followed by static. Branch-history prediction must be combined with another technique, and so we only need to decide if we can combine software with static prediction. A software implementation could provide three different flags: predict-taken, predict-not-taken, and no-prediction. But it is not clear that this would be any gain to including the third option, because static prediction could be emulated by just making the software hint be predict-not-taken.

Conclusion: history mechanism must be combined with another technique. Other combinations do not provide any increased performance or functionality.

Problem 7 (3 Points)

	Forward	Backward
%Jump	75%	25%
%Taken	50%	90%

Problem 7(a) (1 Points)

Why does predicting backwards branches to be taken, as opposed to predicting all branches as not-taken, result in a lower total branch penalty? (where total branch penalty = branch penalty + mispredict penalty) Justify your answer in terms of the behaviour of typical programs.

Most backwards branches are caused by loops, and loops are typically iterated through many times. Thus most backwards branches will be taken.

Predicting a branch to be taken incurs a penalty for the instructions that have entered the pipeline before we recognize the branch and compute its destination. However, this penalty is typically much less than the penalty for mispredicting a branch. So, if we can predict branches sufficiently accurately, it is better to predict them as taken and incur the predict-taken penalty on every branch and the mispredict penalty on only a small number of branches.

Problem 7(b) (2 Points)

How much smaller is the total branch penalty for predicting a branch based on the sign of the offset as compared to predicting all branches as not-taken?

From Handout 14:
Pen_Tot = Pen_Bra + (%MisPred)(Pen_MisPred)

Also from the handout:
Penalty for predicting not-taken (Pen_NT) = 1.2

Penalty for predicting based on sign (Pen_SGN)

Pen_SGN =
= (%Fwd)*(%MisPred_Fwd)*(Pen_MisPred) + (%Bwd)*(Pen_Bra + (%MisPred_Bwd)*(Pen_MisPred))
= (0.75)*(0.50)*(2) + (0.25)*(1 + (0.10)*(2))
= 1.05

Pen_SGN is n% less than Pen_NT =
= (Pen_NT - Pen_SGN) / Pen_SGN
= (1.2 - 1.05)/1.05
= 14%

Prev
Home
Next
Suggestion Box
Last modified: 27 Mar 96