Homework 3 Solutions

1. For architecture, remember that all of the Intex x86 microprocessors are in the same architecture. The 80486 and the Pentium have different numbers of pins, etc, but support the same assembly language instructions. Also, features defined by the architecture affect the correctness of a program, while features defined by an implementation affect the performance of a program. The following features are architecture defined:

Number of bits per instruction
Addressing modes
Number of registers
Size of virtual memory
The maximum size of virtual memory is usually the total amount of memory that a program can access. But operating systems may limit an individual program so that it can only address a portion of memory and the memory manager may also affect how a program accesses memory.
Big-endian -vs- little endian
Some architectures are either big-endian or little endian, others define methods for selecting which mode the microprocessor is in.

Leaving the following as implementation defined:

Number of cache levels
Size of caches
Size of main memory
Latency for a floating-point divide
Average CPI for all instructions
Processor speed
Pipelined -vs- non-pipelined
Superscalar -vs- non-superscalar
Number of pins on the chip

2. This question may have caused a bit of confusion, depending upon the interpretation of indexed addressing mode and accessing the array. Each interpretation affects the solution for part (a) and the answers for the rest of the parts. Therefore, there are three solutions listed for each section of the problem and you should follow the same solution number all of the way through.

(a)

Solution #1

Indexed addressing as (R1)(R2) = M[R1+R2] and i as already shifted over two bits to the left so that it is indexing on 8 byte increments.

Method 1

 
                   ; comment       | cycles  
ADD R3, (R1)(R2)   ; R3 = x + A[i] | 5
                            total  = 5

Method 2

 
                   ; comment       | cycles  
ADD R2, R1         ; R2 = A + i    | 1 
ADD R3, (R2)       ; R3 = x + A[i] | 4
                            total  = 5

Solution #2

Indexed addressing as (R1)(R2) = M[R1+R2] and i as the array index into A, so that it needs to be shifted two bits to the left in order to index on 8 byte increments.

Method 1

SHL R2, #2        1 cycle
ADD R3, (R1)(R2)  5 cycles for a total of 6 cycles

Method 2

SHL R2, #4        1 cycle
ADD R2, R1        1 cycle
ADD R3, (R2)      4 cycles for a total of 6 cycles

Solution #3

Indexed addressing as I(R1)[R2] = M[ I + R1 + d*R2] and i as the array index into A. Method 1

ADD R3, 0(R1)[R2]    5 cycles

Method 2

SHL R2, #2           1 cycle
ADD R2, R1           1 cycle
ADD R3, (R2)         4 cycles for a total of 6 cycles

(b) Solution #1 -- 2 references

Method 1 uses 5 cycles per reference for a total of 10 cycles for 2 references.

Method 2 uses 5 cycles for the first reference and 4 for each subsequent reference for a total of 9 cycles for 2 references.

Solution #2 -- 2 references

Method 1 uses 6 cycles for the first reference and 5 for each subsequent reference for a total of 11 cycles for 2 references.

Method 2 uses 6 cycles for the first reference and 4 for each subsequent reference for a total of 10 cycles for 2 references.

Solution #3 -- 3 references

Method 1 uses 5 cycles for each reference for a total of 15 cycles for 3 references.

Method 2 uses 6 cycles for the first reference and 4 for each subsequent reference for a total of 14 cycles for 3 references.

(c) A single save/restore of a register to the stack costs 10 cycles. This is independent of the solution to part (a).

Solution #1 -- 12 references. The first reference for both methods costs the same number of cycles. After that, each reference by method 2 saves 1 cycle. Therefore, after 10 more references, the cycle count will be the same, leaving one more reference to make method 2 faster.

Method 1 cycles = 5 * n therefore 12 refs => 60 cycles

Method 2 cycles = 5 + 4 * (n-1) + 10 therefore 12 refs => 59 cycles

Solution #2 -- 12 references. Reasoning is the same as solution #1.

Method 1 cycles = 6 + 5 * (n-1) therefore 12 refs => 61 cycles

Method 2 cycles = 6 + 4 * (n-1) + 10 therefore 12 refs => 60 cycles

Solution #3 -- 13 references. Reasoning is the same as solution #1 except that the first reference with method 2 costs an extra cycle, so one more reference is needed to catch up.

Method 1 cycles = 5 * n therefore 13 refs => 65 cycles

Method 2 cycles = 6 + 4 * (n-1) + 10 therefore 13 refs => 64 cycles

(d) For all three solutions, replace the main ADD instruction with a LD instruction and then calculate A[i]+x using register direct mode. With both methods, all future accesses take the same number of cycles (ie, 1) so neither method is faster than the other.

3. Refer to appendix C.4 in the text for the numbers used in these calculations.

(a) 26% of instructions transfer data and 1% (MOVEFP2I and MOVEI2FP ) transfer data between registers.

Therefore, 26-1 or 25% of instructions transfer data to/from memory.

100% of instructions load instructions from memory.

Therefore, 25/125 or 20% of all accesses are for data.

(b) 25% of all instruction transfer data to/from memory (from part a).

LW , LBU and LD are reading from memory and they account for 13%, 1%, and 4% (total of 18%) of the instructions.

Therefore, 18/25 or 72% of all data accesses are reading from memory.

(c) 18% of instructions read data from memory (from part b).

100% of instructions read instructions from memory.

Therefore, (100+18)/(100+25) or 94.4% of all accesses are for reads.

4.(a) Let "n" be the number of instructions in a given program. From table 4.32 in the text, the following formulas represent the number of particular instructions executed by the 8086:

POP 5% * n
PUSH 7% * n
LES 3% * n
MOV 27% * n

The sum of these formulas is (5+7+3+27%) * n = 42% * n

The following formulas represent the number of instructions executed by the modified machine:

POP (75% * 5%) * n * 2
PUSH (75% * 7%) * n * 2
LES (75% * 3%) * n
MOV (75% * 27%) * n

The sum of these formulas is 75% * (5*2+7*2+3+27%) * n = 40.5% * n

All other instruction counts remain the same, and, therefore, may be ignored for the calculations.

Consequently, the new 8086 will execute 42.0% - 40.5% = 1.5% fewer instructions than the original.

(b) To calculate the new CPI, we will compute the number of cycles used for the same program on the new machine and we will calculate the number of instructions used on the new machine. The new CPI will be the new number of cycles divided by the new number of instructions.

Relevant data from table D.5 (time/frequency ratio)

POP 0.6
PUSH 1.0
LES 1.7
MOV 0.8
ADD 0.4

Relevant data from table 4.32 (frequency distribution):

POP 5%
PUSH 7%
LES 3%
MOV 27%
ADD 3%

Using the data from table D.5, calculate CPI for individual instructions as the average CPI multiplied by the time/frequency ratio.

POP 8.46
PUSH 14.1
LES 24.0
MOV 11.3
ADD 5.64

A program with n instructions will use CPI*n or 14.1*n cycles on the original machine.

The new machine will use the same number of cycles minus 1/4 of the LES and MOV cycles, minus all of the PUSH and POP cycles and plus 3/4 of the number of PUSH and POP instructions as the cycles used by a MOV and ADD instruction.

Breaking that down into pieces, 1/4 of the LES and MOV instructions use the following number of cycles:

(n*3%*24.0+n*27%*11.3)/4=n*0.943 cycles

All of the PUSH and POP cycles are:

n*5%*8.46+n*7%*14.1=n*1.41 cycles

The cost of a MOV and an ADD instruction are:

11.3+5.64=16.9 cycles

Multiply that by 3/4 of the number of PUSH and POP instruction to get the cycles used by the PUSH/POP replacement instructions:

(n*5%*75%+n*7%*75%)*16.9=(n*0.09)*16.9=n*1.52

Now, putting it all together, the new number of cycles is:

n*14.1-n*0.943-n*1.41+n*1.52=n*13.3

To compute the new number of instructions, we use the results of part (a) which tell us how many instructions the new machine uses. That is, the new machine uses 98.5%*n instructions. This gives is a new CPI as follows:

(n*13.3)/(n*98.5%)=13.3/0.985=13.5 CPI

Therefore, the original machine has a higher CPI by 0.6 cycles per instruction.

(c) The speedup for the new (faster) CPU is

14.1/13.5=1.04 (or a 4.44% increase)

(d) In this case, the only metric of performance that differs between the two machines is CPI. Performance is inversely proportional to CPI, so the speedup for the new (faster) CPU is

 1/CPI_old - 1/CPI_new 
 ---------------------
       1/CPI_old

multiplying through by CPI_old/CPI_old gives:

 1 - CPI_old/CPI_new

which evaluates to:

1 - 14.1/13.5=0.04 (or a 4.44% increase)