CPSC 418: Assignment #1 Solutions
CPSC 418 - Assignment #1 Solution Set
1. Accumulator
LOAD A, AddrE ; Acc = E
AND A, 0x0001 ; Acc = Acc & 0x0001
BEQZ A, ELSE ; IF (Acc == 0) THEN BRANCH TO ELSE
LOAD A, AddrA ; Acc = A
ADD A, AddrD ; Acc = A + D
STORE AddrA, A ; M[AddrA] = Acc
ELSE LOAD A, AddrD ; Acc = D
SLL A, 1 ; Acc = Acc << 1
STORE AddrD, A ; M[AddrD] = Acc
LOAD A, AddrE ; Acc = E
SRL A, 1 ; Acc = Acc >> 1
STORE AddrE, A ; M[AddrE] = Acc
Inst. bytes = (# inst. w/o immediate * 1 byte) +
(# inst. w/ immediate * 3 byte)
= 0*1 + 12*3
= 36 bytes
Memory bytes = 36 + 8*4 = 68 bytes
Stack
PUSH AddrE
PUSH 0x0001
AND
BEQZ ELSE
PUSH AddrA
PUSH AddrD
ADD
POP AddrA
ELSE PUSH AddrD
DUP
ADD
POP AddrD
PUSH AddrE
SRL
POP AddrE
Inst. bytes = (# inst. w/o immediate * 1 byte) +
(# inst. w/ immediate * 3 byte)
= 5*1 + 10*3
= 35 bytes
Memory bytes = 35 + 8*4 = 67 bytes
Memory/Memory
MOVL D1, AddrE(D0) ; D1 = E
AND D2, D1, #1 ; D2 = E & 0x0001
TSTL D2 ; IF (D2==0) THEN SET COND.
BEQL ELSE
ADDL3 AddrA(D0), AddrA(D0), AddrD(D0) ; A = A + D
ELSE ADDL3 AddrD(D0), AddrD(D0), AddrD(D0) ; D = D + D
ASHL AddrE(D0), AddrE(D0), #-1 ; E = E >> 1
Instruction bytes = 4 + 4 + 2 + 3 + 9 + 9 + 9
= 40 bytes
Memory bytes = 40 + 9*4 = 76 bytes
Load/Store
LOAD R1, AddrE(R0) ; R1 = E
ANDI R2, R1, 0x0001 ; R2 = E & 0x0001
BEQ R2, R0, ELSE ; IF (R2 == 0) THEN GOTO ELSE
LOAD R4, AddrA(R0) ; R4 = A
LOAD R5, AddrD(R0) ; R5 = D
ADD R4, R4, R5 ; R4 = A + D
STORE AddrA(R0), R4 ; M[AddrA] = R4
ELSE LOAD R5, AddrD(R0) ; M5 = D
SLL R5, R5, 1 ; R5 = D + D
STORE AddrD(R0), R5 ; M[AddrD] = R5
SRL R1, R1, 1 ; R1 = E >> 1
STORE AddrE(R0), R1 ; M[AddrE] = R1
Instruction bytes = 12 * 4 = 48 bytes
Memory bytes = 48 + 7*4 = 76 bytes
Code Size Rankings:
Stack < Accumulator < Memory/Memory < Load/Store
Memory Bandwidth Rankings:
Stack < Accumulator < Memory/Memory < Load/Store
Yes, the rankings for the code size and memory bandwidth would change
because the architectures with fewer number of registers will need to
load, store, and reload data frequently with the larger number of
variables in the code fragment.
----------------------------------------------------------------------
2. Cycle count for an average instruction:
M0 = 4 + 3*6 = 22 cycles
M8 = 5 + 0.5(6 + 1) + 0.5(6 + 1) + 0.5(6 + 1) = 15.5 cycles
M16 = 6 + 0.2(6) + 0.8(1) + 0.2(6) + 0.8(1) + 0.2(6) + 0.8(1)
= 12 cycles
Average # of cycles to access memory operand with cache
= 1 + 1 + 0.1(5)
= 2.5 cycles
Cycle count for an average instruction on each machine with cache:
M0 = 4 + 3(2.5) = 11.5 cycles
M8 = 5 + 0.5(2.5 + 1) + 0.5(2.5 + 1) + 0.5(2.5 + 1) = 10.25 cycles
M16 = 6 + 0.2(2.5) + 0.8(1) + 0.2(2.5) + 0.8(1) + 0.2(2.5) + 0.8(1)
= 9.9 cycles
Chip price/performance ratio
W/O Cache
M0 = 500 * 22 = 11000
M8 = 500 * 15.5 = 7750
M16 = 500 * 12 = 6000
W/ Cache
M0 = 600 * 11.5 = 6900
M8 = 600 * 10.25 = 6150
M16 = 600 * 9.9 = 5940
The M16 chip with cache has the best price/performance ratio,
by a tiny margin.
-----------------------------------------------------------------------
3. For DLX machine:
Branches = (14 + 7 + 18) / 3 = 13%
L/S = (36 + 28 + 36) / 3 = 33.3%
ALU = (50 + 65 + 46) / 3 = 53.7%
a) InstructionsDLX = 13 + 33.3 + 53.7 = 100
InstructionsEDLX = 13 + 33.3 + 53.7 - 0.10(33.3) = 96.67
b) Total_cycles_EDLX = 0.9667 * InstructionsDLX * CPI
Total_cycles_DLX = InstructionsDLX * CPI
Clock_EDLX = clock_DLX + 0.05clock_DLX
= 1.05clock_DLX
CPU_Time_EDLX = (Total_cycles_EDLX)(1.05clock_DLX)
= (0.9667 * InstructionsDLX * CPI)(1.05clock_DLX)
= 1.015(InstructionsDLX)(CPI)(clock_DLX)
CPU_Time_DLX = (Total_cycles_DLX)(clock_DLX)
= (InstructionsDLX * CPI)(clock_DLX)
= 1.0(InstructionsDLX)(CPI)(clock_DLX)
DLX is faster than the enhanced DLX.
(CPU_Time_EDLX - CPU_Time_DLX) / CPU_Time_DLX
= (1.015 - 1.0) / 1.0
= 0.015
DLX is 1.50% faster than the Enhanced DLX.
----------------------------------------------------------------------
4. Branches = (14 + 7 + 18) / 3 = 13%
L/S = (36 + 28 + 36) / 3 = 33.3%
ALU = (50 + 65 + 46) / 3 = 53.7%
a) Avg_LS_Inst_Len = 16(0.16) + 24(0.62 - 0.16) + 32(1.0 - 0.62 - 0.16)
= 25.76 bits
Avg_Branch_Inst_Len = 16(0.0) + 24(0.89) + 32(1.0 - 0.89)
= 24.88 bits
Avg_ExecutedInst_Len = 24.88(0.13) + 25.76(0.333) + 16(0.537)
= 20.40 bits
b) Avg_LS_Inst_Len = 24(0.62) + 48(0.38) = 33.12 bits
Avg_Branch_Inst_Len = 24(0.89) + 48(0.11) = 26.64 bits
Avg_ExecutedInst_Len = 26.64(0.13) + 33.12(0.333) + 24(0.537)
= 27.38 bits
# of bytes fetched = 1000 * (27.38 / 20.40) = 1342 bytes
c) Avg_ExecutedInst_Len = 32 bits
# of bytes fetched = 1000 * (32 / 20.40) = 1569 bytes
d) Avg_LS_Inst_Len = 16(0.16) + 32(0.84) = 29.44 bits
Avg_Branch_Inst_Len = 32 bits
Avg_ExecutedInst_Len = 32(0.13) + 29.44(0.333) + 16(0.537)
= 22.56 bits
# of bytes fetched = 1000 * (22.56 / 20.40) = 1106 bytes
e) Machine A
= (20.40 bits/inst * 1/8 * 1 cycle/byte) + 2 cycles/inst
= 4.55 cycles/inst
Machine B
= 27.38 bits/inst * 1/8 * 1 cycle/byte
= 3.42 cycles/inst
Machine C
= 32 bits/inst * 1/8 * 1 cycle/byte
= 4 cycles/inst
Machine D
= (22.56 bits/iinst * 1/8 * 1 cycle/byte) + 1 cycle/inst
= 3.82 cycles/inst
--------------------------------------------------------------------------
5. With partial word access
= 1(0.14) + 1(0.36) + 1(0.50) = 1 cycle/program
Without partial word access
= 1(0.14) + 0.11(3*0.12 + 2*0.24) + 0.89(1*0.36) + 1(0.50)
= 1.0528 cycle/program
1 cycle/program 1.0528 cycle/program
--------------- = --------------------
clock_rate_A clock_rate_B
clock_rate_B = 1.0528clock_rate_A
The clock rate must be increased by a factor of at least 1.0528
to make eliminating partial word accesses a good tradeoff.
-------------------------------------------------------------------------
6.
a) The four steps the microcode must perform were
1) align the access to a word boundary
2) load the word
3) mask out unwanted bits
4) shift wanted bits to bottom of register
Below is a fairly efficient algorithm:
(1) is accomplished by shifting the address right two bits and then
back left two bits, thus setting the bottom two bits to zero and word
aligning the address.
(2) is a simple lw instruction
(3) and (4) are accomplished together in two steps. First, the
interesting data is shifted right to the top of m0. In order to
figure out how far this shift should be, subtract m2 from 31 (which
needs two instructions, since you need to negate m2). Then shift left
to put the data at the top of the register.
Now you need to check 4 cases based on m1 and m3. If the data is
unsigned, logical shift will be used to move the data to the bottom of
the register (since it fills bits on the left as zeroes). If the data
is signed, arithmetic shift will be used to move the data back to the
bottom (since it fills bits on the left to preserve sign). The other
two cases are whether the data is a byte (shift right 24 bits) or a
halfword (shift right 16 bits).
; word align address
SRLI m0, m0, #2 ; shift bottom two bits out of addr
SLLI m0, m0, #2 ; shift addr back into position
; load word
LW m0, 0(m0) ; load the word
; shift data to the top of register
SUB m2, r0, m2 ; -m2 (a negative number)
ADDI m2, m2, #31 ; m2 = 31 - (original m2)
SLLV m0, m0, m2 ; shift interesting data to top of register
; case analysis
BEQZ m3, ELSE1 ; IF unsigned THEN goto ELSE1
BNEZ m1, ELSE2 ; IF halfword THEN goto ELSE2
SRAI m0, m0, #24 ; m0 = m0 >> 24 (Arithmetic shift)
RFE ; finished
ELSE2 SRAI m0, m0, #16 ; m0 = m0 >> 16 (Arithmetic shift)
RFE ; finished
ELSE1 BNEZ m1, ELSE3 ; IF halfword THEN goto ELSE3
SRLI m0, m0, #24 ; m0 = m0 >> 24 (Logical shift)
RFE ; finished
ELSE3 SRLI m0, m0, #16 ; m0 = m0 >> 16 (Logical shift)
RFE ; finished
8 instructions will always be executed (once you include a right shift
and rfe). The only questions is how many branches. If all accesses
are equally likely, you will execute the unsigned branch 50% of the
time, and a halfword branch 50% of the time. So
(pipeline penalty) 5
(necessary code) 8
(branches) + 1 = (0 (0.25) + 1 (0.25 + 0.25) + 2 (0.25))
total 14 cycles (approx.)
b)
type of access select1 select2 select3 select4
full word
signed M[0-7] M[8-15] M[16-23] M[24-31]
unsigned M[0-7] M[8-15] M[16-23] M[24-31]
half word in bottom half of memory bus (bits 0-15)
signed M[0-7] M[8-15] bp1 x 8 bp2 x 8
unsigned M[0-7] M[8-15] 0 x 8 0 x 8
half word in top half of memory bus (bits 16-31)
signed M[16-23] M[24-31] bp1 x 8 bp2 x 8
unsigned M[16-23] M[24-31] 0 x 8 0 x 8
byte in bottom byte of memory bus (bits 0-7)
signed M[0-7] bp0 x 8 bp1 x 8 bp2 x 8
unsigned M[0-7] 0 x 8 0 x 8 0 x 8
byte in second byte of memory bus (bits 8-15)
signed M[8-15] bp0 x 8 bp1 x 8 bp2 x 8
unsigned M[8-15] 0 x 8 0 x 8 0 x 8
byte in third byte of memory bus (bits 16-23)
signed M[16-23] bp0 x 8 bp1 x 8 bp2 x 8
unsigned M[16-23] 0 x 8 0 x 8 0 x 8
byte in top byte of memory bus (bits 24-31)
signed M[24-31] bp0 x 8 bp1 x 8 bp2 x 8
unsigned M[24-31] 0 x 8 0 x 8 0 x 8
c) No, the four MUXs do not operate in parallel because 3 of the MUXs depend
on the output from a previous MUX. So there is a serialized order in which
the MUXs operate. WIth only enough time to access the cache and get the
signal through a single MUX, the memory access time must be increased
beyond one cycle time.
d) Here are just a few (the first two are key issues):
- increased cycle time using new hardware vs. the total number of cycles
needed using old microcode for unaligned accesses
- frequency of unaligned accesses in programs (ie. word processors
perform many byte operations)
- additional MUX hardware required for the second version vs
microcode store for the "software" soln
- design time to create the new hardware
- testing time to make sure the design engineer didn't mess up
Last modified: Feb 97