Prev | Home | Next |
LADDx 0x1000,NULL,R3 ; R3 = i (loop count) LOOP_I: ILOAD (R11,R3),R13 ; R13 = B[i] ILOAD (R14,R3),R16 ; R16 = C[i] NOP MULx R13,R16,R10 ; R10 = B[i] * C[i] ISTORE R10,(R8,R3) ; R10 = A[i] (result) LSUBc R3,1,R3 ; i = i - 1 JUMPGTE LOOP_I NOPDisplacement Addressing Mode:
LADDx 0x1000, NULL, R3 ; R3 = i (loop count) LOOP_D: ADD R11, R3, R12 ; R12 = &B[i] DLOAD 0(R12), R13 ; R13 = M[R12] ADD R14, R3, R15 ; R15 = &C[i] DLOAD 0(R15), R16 ; R16 = M[R15] NOP MULx R13, R16, R10 ; R10 = B[i] * C[i] ADD R8, R3, R9 ; R9 = A[i] DSTORE R10, 0(R9) ; M[R9] = R10 (result) LSUBc R3, 1, R3 ; i = i - 1 JUMPGTE LOOP_D NOP
| #Inst | Memory Access ------------------------------------------------- Static | 9 | 9+3=12 (see footnote1) ------------------------------------------------- Dynamic | [1+(4097)8] | [1+4097(8+3)] | =32777 | =45068 -------------------------------------------------
| #Inst | Memory Access ------------------------------------------------- Static | 12 | 12+3=15 (see footnote1) ------------------------------------------------- Dynamic | [1+(4097)11] | [1+(4097)(11+3)] | =49168 | =57359 ------------------------------------------------- footnote1: STORE and LOAD each have 2 memory accesses.
IA: Index Addressing Mode DA: Displacement Addressing ModeTotal Cycles for Index Addressing:
Total_cycles_IA = (#Arithmetic_Inst)(CPI_Arithmetic) + (#Load_Inst)(CPI_Load) + (#Store_Inst)(CPI_store) + (#Jump_Inst)(CPI_Jump) + (#NOP)(CPI_NOP) Total_cycles_IA = (1+2(4097))(1.7) + (2(4097))(2.4) + (4097)(2.6) + (4097)(2.2) + (2(4097))(1.5) Total_cycles_IA = 65553.7Total Cycles for Displacement Addressing:
Total_cycles_DA = (1+5(4097))(1.7) + (2(4097))(2.4) + (4097)(2.6) + (4097)(2.2) + (2(4097))(1.5) Total_cycles_DA = 86448.4Since it is given that with addressing mode the clock cycle increases by 10%, then
clock_IA = clock_DA + 0.10clock_DA clock_IA = 1.1clock_DA CPU_Time_IA = (Total_cycles_IA)(1.1clock_DA) CPU_Time_IA = (65553.3)(1.1clock_DA) CPU_Time_IA = 72108.63clock_DA CPU_Time_DA = (Total_cycles_DA)(clock_DA) CPU_Time_DA = (86448.4)(clock_DA) CPU_Time_DA = 86448.4clock_DAClearly, Indexed Addressing is faster than Displacement Adressing. But, by how much?
(CPU_Time_DA - CPU_Time_IA) / CPU_Time_IA = (86448.4-72108.63 )/72108.63 = 0.20The program will be approximately 20% faster with Indexed Addressing than with Displacement Addressing.
If we have a program trace with 100 instructions and the instruction mix from Handout 8, we can write an equation for the time for running the program on the indexed machine as:
Time_IA = ( (#Inst_L/S)*(CPI_L/S) + (#Inst_Other)*(CPI_Other) ) * 1.1*Clock_DA
When we go to the machine with just displacement addressing mode, we have to take 14% of #Inst_L/S and replace them with a L/S inst and an Add inst:
Time_DA = ( 0.86*(#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_L/S+CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DArewriting to group common terms:
Time_DA = ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DA
NOTE: At this point, we could calculate numerical values for Time_DA and Time_IA, but that hides the algebra of what's happening. It's important to always keep in mind exactly what we are solving for and what is the minimal amount of numerical data needed to answer the question.now calculate difference between Time_IA and Time_DA:
Time_IA - Time_DA = = ( (#Inst_L/S)*(CPI_L/S) + (#Inst_Other)*(CPI_Other) ) * 1.1*Clock_DA - ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DA =rewriting to group common terms:
( (1.1 - 1.0)*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) + (1.1 - 1.0)*(#Int_Other)(CPI_Other) ) * Clock_DA =simplify:
( 0.1*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) + 0.1*(#Inst_Other)*(CPI_Other) ) * Clock_DA =
diversion to calculate values not in table:now plug numbers into:#Inst_Other = 100 - #Inst_L/S = 100 - 28.8 = 71.2 CPI_Other = ((#Inst_Arith)*(CPI_Arith) + (#Inst_Ctl)*(CPI_Ctl) + (#Inst_FP)*(CPI_FP) + (#Inst_other)*(CPI_other) ) / (#Inst_Other) = (35.0*1.7 + 9.8*2.3 + 7.5*3.7 + 18.9*1.5) / 71.2 = 1.9
( 0.1*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) + 0.1*(#Inst_Other)*(CPI_Other) ) * Clock_DAto get:
( ( 0.1 * 28.8 * 2.5 ) - ( 0.14 * 28.8 * 1.7 ) + ( 0.1 * 71.2 * 1.9 ) ) * Clock_DA = Time_IA - Time_DA = 13.9Therefore Time_IA > Time_DA, so displacement addressing is faster.
How much faster?
( Time_Slow - Time_Fast ) / Time_Fast = ( Time_IA - Time_DA ) / Time_DA Time_DA = ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DA = ( (28.8 * 2.5) + (0.14 * 28.8 * 1.7) + (71.2 * 1.9) ) * Clock_DA = 214 = 13.9 / 214 = 6.5%
____________ ; Original R2: |R6|R5|R4|R3| ------------ ; R3 SHIFTux 24(NUL),R2,R3 ; left shift R3 by 3 bytes SHIFTsx -24(NUL),R3,R3 ; isoloate R3 ; R4 SHIFTsx -8(NUL),R2,R2 ; left shift R2 by a byte so that R4 sits in the place of the original R3. SHIFTux 24(NUL),R2,R4 ; Same as R3 again SHIFTsx -24(NUL),R4,R4 ; Same as R3 again ; R5 SHIFTsx -8(NUL),R2,R2 ; Same as R4 SHIFTux 24(NUL),R2,R5 ; Same as R4 SHIFTsx -24(NUL),R4,R5 ; Same as R4 ; R6 ; R6 is at this point separated.Packing bytes:
; Construct R2 from separate registers R3,R4,R5,R6 ; R3 LANDx R3,0xFF,R2 ; Mask R3 and store result in R2 ; R4 LANDx R4,0xFF,R4 ; Mask R4 SHIFTux 8(NUL),R4,R4 ; Shift left R4 by a byte, i.e., isolate R4 LORx R4,R2,R2 ; Combine R4 with current R2 ; R5 LANDx R5,0xFF,R5 ; Mask R5 SHIFTux 16(NUL),R5,R5 ; Shift left R5 by a byte, i.e., isolate R5 LORx R5,R2,R2 ; Combine R5 with current R2 ; R6 SHIFTux 24(NUL),R6,R6 ; No need to mask the last one, simply shift as for the other ones. LORx R6,R2,R2 ; Combine R6 with R2 to contruct entire R2.
2 = (2^0.25)(1.07)(x) where x = improvement Solving for x, we get x=1.57Other changes (than the ones specified by the problem) must contribute a 57% performance improvement in order to ensure doubled performance increase every year. Note that this is not completely realistic. Making the clock twice as fast just makes memory a bottleneck. The programs do not run twice as fast.
Prev | Home | Next |