CPSC 418: Solution 3

Home

Problem 1

(Part 1a)

Indexed Addressing Mode:

         LADDx  0x1000,NULL,R3          ; R3 = i (loop count)
 LOOP_I: ILOAD  (R11,R3),R13            ; R13 = B[i]
         ILOAD  (R14,R3),R16            ; R16 = C[i]
         NOP
         MULx    R13,R16,R10            ; R10 = B[i] * C[i]
         ISTORE  R10,(R8,R3)            ; R10 = A[i] (result)
         LSUBc   R3,1,R3                ; i = i - 1
         JUMPGTE LOOP_I
         NOP

Displacement Addressing Mode:

         LADDx   0x1000, NULL, R3       ; R3  = i (loop count)
 LOOP_D: ADD     R11,    R3,   R12      ; R12 = &B[i]
         DLOAD   0(R12),  R13           ; R13 = M[R12]
         ADD     R14,    R3,   R15      ; R15 = &C[i]
         DLOAD   0(R15),  R16           ; R16 = M[R15]
         NOP
         MULx    R13,    R16,  R10      ; R10 = B[i] * C[i]
         ADD     R8,     R3,   R9       ; R9  = A[i]
         DSTORE  R10,    0(R9)          ; M[R9] = R10 (result)  
         LSUBc   R3,     1,    R3       ; i   = i - 1
         JUMPGTE LOOP_D
         NOP

(Part 1b)

Note: 0x1000 corresponds to 2^11 = 4096

Index Addressing Mode:


        |    #Inst       |   Memory Access
-------------------------------------------------
Static  |      9         |  9+3=12 (see footnote1)
-------------------------------------------------
Dynamic | [1+(4097)8]    |  [1+4097(8+3)]
        | =32777         |  =45068
-------------------------------------------------

Displacement Addressing Mode:


        |    #Inst       |   Memory Access
-------------------------------------------------
Static  |      12        |  12+3=15 (see footnote1)
-------------------------------------------------
Dynamic | [1+(4097)11]   |  [1+(4097)(11+3)]
        | =49168         |  =57359
-------------------------------------------------
footnote1: STORE and LOAD each have 2 memory accesses.

(Part 1c)

Index Addressing requires adding 32-bit numbers, whereas displacement addressing mode has a 32-bit + 16 offset. Therefore, adding indexed addressing increases the length of the clock cycle.

(Part 1d)

For this problem, the total number of cycles can be calculated. Also,

IA: Index Addressing Mode
DA: Displacement Addressing Mode

Total Cycles for Index Addressing:

Total_cycles_IA = (#Arithmetic_Inst)(CPI_Arithmetic) + (#Load_Inst)(CPI_Load) 
                  + (#Store_Inst)(CPI_store) + (#Jump_Inst)(CPI_Jump)
                  + (#NOP)(CPI_NOP)

Total_cycles_IA = (1+2(4097))(1.7) + (2(4097))(2.4) 
                  + (4097)(2.6) + (4097)(2.2) 
                  + (2(4097))(1.5)
Total_cycles_IA = 65553.7

Total Cycles for Displacement Addressing:

Total_cycles_DA = (1+5(4097))(1.7) + (2(4097))(2.4) 
                  + (4097)(2.6) + (4097)(2.2) 
                  + (2(4097))(1.5)
Total_cycles_DA = 86448.4

Since it is given that with addressing mode the clock cycle increases by 10%, then

      clock_IA = clock_DA + 0.10clock_DA
      clock_IA = 1.1clock_DA

CPU_Time_IA = (Total_cycles_IA)(1.1clock_DA)
CPU_Time_IA = (65553.3)(1.1clock_DA)
CPU_Time_IA = 72108.63clock_DA

CPU_Time_DA = (Total_cycles_DA)(clock_DA)
CPU_Time_DA = (86448.4)(clock_DA)
CPU_Time_DA = 86448.4clock_DA

Clearly, Indexed Addressing is faster than Displacement Adressing. But, by how much?

(CPU_Time_DA - CPU_Time_IA) / CPU_Time_IA = (86448.4-72108.63 )/72108.63 = 0.20

The program will be approximately 20% faster with Indexed Addressing than with Displacement Addressing.

(Part 1e)

This solution is not the shortest way to solve the problem, see the note below, but it illustrates algebraically where the mistake in the original solution is.

If we have a program trace with 100 instructions and the instruction mix from Handout 8, we can write an equation for the time for running the program on the indexed machine as:

Time_IA = ( (#Inst_L/S)*(CPI_L/S) + (#Inst_Other)*(CPI_Other) ) * 1.1*Clock_DA

When we go to the machine with just displacement addressing mode, we have to take 14% of #Inst_L/S and replace them with a L/S inst and an Add inst:

Time_DA = ( 0.86*(#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_L/S+CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DA

rewriting to group common terms:

Time_DA = ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + (#Inst_Other)*(CPI_Other) ) * Clock_DA

NOTE: At this point, we could calculate numerical values for Time_DA and Time_IA, but that hides the algebra of what's happening. It's important to always keep in mind exactly what we are solving for and what is the minimal amount of numerical data needed to answer the question.

now calculate difference between Time_IA and Time_DA:

  Time_IA - Time_DA = 
   = ( (#Inst_L/S)*(CPI_L/S) + (#Inst_Other)*(CPI_Other) ) * 1.1*Clock_DA
     -
     ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + 
       (#Inst_Other)*(CPI_Other) 
     ) * Clock_DA

   =

rewriting to group common terms:

     ( (1.1 - 1.0)*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) +
       (1.1 - 1.0)*(#Int_Other)(CPI_Other)
     ) * Clock_DA
   =

simplify:

     ( 0.1*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) +
       0.1*(#Inst_Other)*(CPI_Other)
     ) * Clock_DA
   =

diversion to calculate values not in table:

      #Inst_Other = 100 - #Inst_L/S
                  = 100 - 28.8
                  = 71.2

      CPI_Other =
      ((#Inst_Arith)*(CPI_Arith) + (#Inst_Ctl)*(CPI_Ctl) +
       (#Inst_FP)*(CPI_FP) + (#Inst_other)*(CPI_other) ) / (#Inst_Other)
 
      = (35.0*1.7 + 9.8*2.3 + 7.5*3.7 + 18.9*1.5) / 71.2
      = 1.9

now plug numbers into:

 ( 0.1*(#Inst_L/S)*(CPI_L/S) - 0.14*(#Inst_L/S)*(CPI_Add) +
   0.1*(#Inst_Other)*(CPI_Other) ) * Clock_DA

to get:

 ( ( 0.1 * 28.8 * 2.5 ) - ( 0.14 * 28.8 * 1.7 ) +
   ( 0.1 * 71.2 * 1.9 ) ) * Clock_DA

 =
  Time_IA - Time_DA = 13.9

Therefore Time_IA > Time_DA, so displacement addressing is faster.

How much faster?

  ( Time_Slow - Time_Fast ) / Time_Fast 
  = ( Time_IA - Time_DA ) / Time_DA 

     Time_DA = ( (#Inst_L/S)*(CPI_L/S) + 0.14*(#Inst_L/S)*(CPI_Add) + 
                 (#Inst_Other)*(CPI_Other) 
               ) * Clock_DA
             = ( (28.8 * 2.5) + (0.14 * 28.8 * 1.7) + (71.2 * 1.9) ) * Clock_DA
             = 214

  = 13.9 / 214
  = 6.5%

Problem 2

Feldman & Retter Problem 2.5

Unpacking bytes:

                     ____________
     ; Original R2:  |R6|R5|R4|R3|
                     ------------
     ; R3
     SHIFTux  24(NUL),R2,R3     ; left shift R3 by 3 bytes
     SHIFTsx -24(NUL),R3,R3     ; isoloate R3

     ; R4
     SHIFTsx  -8(NUL),R2,R2     ; left shift R2 by a byte so that R4 sits in the place of the original R3.
     SHIFTux  24(NUL),R2,R4     ; Same as R3 again
     SHIFTsx -24(NUL),R4,R4     ; Same as R3 again

     ; R5
     SHIFTsx  -8(NUL),R2,R2     ; Same as R4
     SHIFTux  24(NUL),R2,R5     ; Same as R4
     SHIFTsx -24(NUL),R4,R5     ; Same as R4

     ; R6 
     ; R6 is at this point separated.

Packing bytes:

    ; Construct R2 from separate registers R3,R4,R5,R6
    ; R3
    LANDx  R3,0xFF,R2      ; Mask R3 and store result in R2

    ; R4
    LANDx  R4,0xFF,R4      ; Mask R4
    SHIFTux 8(NUL),R4,R4   ; Shift left R4 by a byte, i.e., isolate R4
    LORx   R4,R2,R2        ; Combine R4 with current R2

    ; R5
    LANDx  R5,0xFF,R5       ; Mask R5
    SHIFTux 16(NUL),R5,R5   ; Shift left R5 by a byte, i.e., isolate R5
    LORx   R5,R2,R2         ; Combine R5 with current R2

    ; R6
    SHIFTux 24(NUL),R6,R6   ; No need to mask the last one, simply shift as for the other ones.
    LORx   R6,R2,R2         ; Combine R6 with R2 to contruct entire R2.

Problem 3

Increase single-issue CPI of all instructions by 5%

More contention for functional units.
More memory conflicts & increase performance of CPU result in memory becoming a bottleneck.
These are not huge factors, so 5% is realistic and 50% is not.

No increase/decrease single-issue CPI of arithmetic instructions by 10 or 50%

In this case, there is no affect on arithmetic operations.

Increase single-issue CPI of loads and stores by 10%

Memory contention. Again, a 10% (and not 50%) increase occurs.

Increase design time

More complex CPU (i.e., hardware).

Slow down the clock rate by 5%

Slow down clock by 5%. Note that if clock slows down by 50% then we are only running half as fast and thus there will be no gain for dual issuing.

Problem 4

(Part 4a)

      2 = (2^0.25)(1.07)(x)   where x = improvement 
           Solving for x, we get x=1.57

Other changes (than the ones specified by the problem) must contribute a 57% performance improvement in order to ensure doubled performance increase every year. Note that this is not completely realistic. Making the clock twice as fast just makes memory a bottleneck. The programs do not run twice as fast.

(Part 4b)

Changes that contribute the most to performance improvements:

Microarchitecture changes (e.g. superscalar: dual issuing instead of single issuing)
Compiler technology (#inst/program)
Increase size and levels of caches

Home

Last modified: 8 Feb 96