CPSC 418: Assignment #1 Solutions

CPSC 418 Home Page

CPSC 418 - Assignment #1 Solution Set

1.  Accumulator

	LOAD	A, AddrE		; Acc = E
	AND	A, 0x0001		; Acc = Acc & 0x0001
	BEQZ	A, ELSE			; IF (Acc == 0) THEN BRANCH TO ELSE
	LOAD	A, AddrA		; Acc = A
	ADD	A, AddrD		; Acc = A + D
	STORE	AddrA, A		; M[AddrA] = Acc
ELSE	LOAD	A, AddrD		; Acc = D
	SLL	A, 1			; Acc = Acc << 1
	STORE	AddrD, A		; M[AddrD] = Acc
	LOAD	A, AddrE		; Acc = E
	SRL	A, 1			; Acc = Acc >> 1
	STORE	AddrE, A		; M[AddrE] = Acc

	Inst. bytes = (# inst. w/o immediate * 1 byte) +
		      (# inst. w/ immediate * 3 byte)
		    = 0*1 + 12*3
		    = 36 bytes

	Memory bytes = 36 + 8*4 = 68 bytes

    Stack

	PUSH	AddrE
	PUSH	0x0001
	AND
	BEQZ	ELSE
	PUSH	AddrA
	PUSH	AddrD
	ADD
	POP	AddrA
ELSE	PUSH	AddrD
	DUP
	ADD
	POP	AddrD
	PUSH	AddrE
	SRL
	POP	AddrE

	Inst. bytes = (# inst. w/o immediate * 1 byte) +
		      (# inst. w/ immediate * 3 byte)
		    = 5*1 + 10*3
		    = 35 bytes

	Memory bytes = 35 + 8*4 = 67 bytes


    Memory/Memory

	MOVL	D1, AddrE(D0)			; D1 = E
	AND	D2, D1, #1			; D2 = E & 0x0001
	TSTL	D2				; IF (D2==0) THEN SET COND.
	BEQL	ELSE
	ADDL3	AddrA(D0), AddrA(D0), AddrD(D0)	; A = A + D
ELSE	ADDL3	AddrD(D0), AddrD(D0), AddrD(D0)	; D = D + D
	ASHL	AddrE(D0), AddrE(D0), #-1	; E = E >> 1


	Instruction bytes = 4 + 4 + 2 + 3 + 9 + 9 + 9
			  = 40 bytes

	Memory bytes = 40 + 9*4 = 76 bytes


    Load/Store

	LOAD	R1, AddrE(R0)		; R1 = E
	ANDI	R2, R1, 0x0001		; R2 = E & 0x0001
	BEQ	R2, R0, ELSE		; IF (R2 == 0) THEN GOTO ELSE
	LOAD	R4, AddrA(R0)		; R4 = A
	LOAD	R5, AddrD(R0)		; R5 = D
	ADD	R4, R4, R5		; R4 = A + D
	STORE	AddrA(R0), R4		; M[AddrA] = R4
ELSE	LOAD	R5, AddrD(R0)		; M5 = D
	SLL	R5, R5, 1		; R5 = D + D
	STORE	AddrD(R0), R5		; M[AddrD] = R5
	SRL	R1, R1, 1		; R1 = E >> 1
	STORE	AddrE(R0), R1		; M[AddrE] = R1


	Instruction bytes = 12 * 4 = 48 bytes

	Memory bytes = 48 + 7*4 = 76 bytes


Code Size Rankings:
	Stack < Accumulator < Memory/Memory < Load/Store

Memory Bandwidth Rankings:
	Stack < Accumulator < Memory/Memory < Load/Store

Yes, the rankings for the code size and memory bandwidth would change
because the architectures with fewer number of registers will need to
load, store, and reload data frequently with the larger number of
variables in the code fragment.

----------------------------------------------------------------------

2.  Cycle count for an average instruction:
	M0 = 4 + 3*6 = 22 cycles
	M8 = 5 + 0.5(6 + 1) + 0.5(6 + 1) + 0.5(6 + 1) = 15.5 cycles
	M16 = 6 + 0.2(6) + 0.8(1) + 0.2(6) + 0.8(1) + 0.2(6) + 0.8(1)
	    = 12 cycles

    Average # of cycles to access memory operand with cache
	= 1 + 1 + 0.1(5)
	= 2.5 cycles

    Cycle count for an average instruction on each machine with cache:
	M0 = 4 + 3(2.5) = 11.5 cycles
	M8 = 5 + 0.5(2.5 + 1) + 0.5(2.5 + 1) + 0.5(2.5 + 1) = 10.25 cycles
	M16 = 6 + 0.2(2.5) + 0.8(1) + 0.2(2.5) + 0.8(1) + 0.2(2.5) + 0.8(1)
	    = 9.9 cycles

    Chip price/performance ratio
	W/O Cache
	   M0 = 500 * 22 = 11000
	   M8 = 500 * 15.5 = 7750
	   M16 = 500 * 12 = 6000
	W/ Cache
	   M0 = 600 * 11.5 = 6900
	   M8 = 600 * 10.25 = 6150
	   M16 = 600 * 9.9 = 5940

	The M16 chip with cache has the best price/performance ratio,
            by a tiny margin.

-----------------------------------------------------------------------

3. For DLX machine:
     Branches = (14 + 7 + 18) / 3 = 13%
     L/S = (36 + 28 + 36) / 3 = 33.3%
     ALU = (50 + 65 + 46) / 3 = 53.7%

a)  InstructionsDLX = 13 + 33.3 + 53.7 = 100
    InstructionsEDLX = 13 + 33.3 + 53.7 - 0.10(33.3) = 96.67


b)  Total_cycles_EDLX = 0.9667 * InstructionsDLX * CPI
    Total_cycles_DLX = InstructionsDLX * CPI

    Clock_EDLX = clock_DLX + 0.05clock_DLX
               = 1.05clock_DLX

    CPU_Time_EDLX = (Total_cycles_EDLX)(1.05clock_DLX)
                  = (0.9667 * InstructionsDLX * CPI)(1.05clock_DLX)
                  = 1.015(InstructionsDLX)(CPI)(clock_DLX)

    CPU_Time_DLX = (Total_cycles_DLX)(clock_DLX)
                 = (InstructionsDLX * CPI)(clock_DLX)
                 = 1.0(InstructionsDLX)(CPI)(clock_DLX)

	DLX is faster than the enhanced DLX.

	(CPU_Time_EDLX - CPU_Time_DLX) / CPU_Time_DLX
		= (1.015 - 1.0) / 1.0
		= 0.015

	DLX is 1.50% faster than the Enhanced DLX.


----------------------------------------------------------------------

4.  Branches = (14 + 7 + 18) / 3 = 13%
    L/S = (36 + 28 + 36) / 3 = 33.3%
    ALU = (50 + 65 + 46) / 3 = 53.7%

a)  Avg_LS_Inst_Len = 16(0.16) + 24(0.62 - 0.16) + 32(1.0 - 0.62 - 0.16)
                    = 25.76 bits

    Avg_Branch_Inst_Len = 16(0.0) + 24(0.89) + 32(1.0 - 0.89)
                        = 24.88 bits

    Avg_ExecutedInst_Len = 24.88(0.13) + 25.76(0.333) + 16(0.537)
                         = 20.40 bits

b)  Avg_LS_Inst_Len = 24(0.62) + 48(0.38) = 33.12 bits
    Avg_Branch_Inst_Len = 24(0.89) + 48(0.11) = 26.64 bits

    Avg_ExecutedInst_Len = 26.64(0.13) + 33.12(0.333) + 24(0.537)
                         = 27.38 bits

    # of bytes fetched = 1000 * (27.38 / 20.40) = 1342 bytes

c)  Avg_ExecutedInst_Len = 32 bits

    # of bytes fetched = 1000 * (32 / 20.40) = 1569 bytes

d)  Avg_LS_Inst_Len = 16(0.16) + 32(0.84) = 29.44 bits
    Avg_Branch_Inst_Len = 32 bits

    Avg_ExecutedInst_Len = 32(0.13) + 29.44(0.333) + 16(0.537)
                         = 22.56 bits

    # of bytes fetched = 1000 * (22.56 / 20.40) = 1106 bytes

e)  Machine A
	= (20.40 bits/inst * 1/8 * 1 cycle/byte) + 2 cycles/inst
	= 4.55 cycles/inst

    Machine B
	= 27.38 bits/inst * 1/8 * 1 cycle/byte
	= 3.42 cycles/inst

    Machine C
	= 32 bits/inst * 1/8 * 1 cycle/byte
	= 4 cycles/inst

    Machine D
	= (22.56 bits/iinst * 1/8 * 1 cycle/byte) + 1 cycle/inst
	= 3.82 cycles/inst

--------------------------------------------------------------------------

5. With partial word access
	= 1(0.14) + 1(0.36) + 1(0.50) = 1 cycle/program

   Without partial word access
	= 1(0.14) + 0.11(3*0.12 + 2*0.24) + 0.89(1*0.36) + 1(0.50)
	= 1.0528 cycle/program


	1 cycle/program   1.0528 cycle/program
	--------------- = --------------------
        clock_rate_A      clock_rate_B

	clock_rate_B = 1.0528clock_rate_A

	The clock rate must be increased by a factor of at least 1.0528
        to make eliminating partial word accesses a good tradeoff.

-------------------------------------------------------------------------

6.

a) The four steps the microcode must perform were
	1) align the access to a word boundary
	2) load the word
	3) mask out unwanted bits
	4) shift wanted bits to bottom of register

Below is a fairly efficient algorithm:

(1) is accomplished by shifting the address right two bits and then
back left two bits, thus setting the bottom two bits to zero and word
aligning the address.

(2) is a simple lw instruction

(3) and (4) are accomplished together in two steps.  First, the
interesting data is shifted right to the top of m0.  In order to
figure out how far this shift should be, subtract m2 from 31 (which
needs two instructions, since you need to negate m2).  Then shift left
to put the data at the top of the register.

Now you need to check 4 cases based on m1 and m3.  If the data is
unsigned, logical shift will be used to move the data to the bottom of
the register (since it fills bits on the left as zeroes).  If the data
is signed, arithmetic shift will be used to move the data back to the
bottom (since it fills bits on the left to preserve sign).  The other
two cases are whether the data is a byte (shift right 24 bits) or a
halfword (shift right 16 bits).

		; word align address
	SRLI	m0, m0, #2	; shift bottom two bits out of addr
	SLLI	m0, m0, #2	; shift addr back into position
		; load word
	LW	m0, 0(m0)	; load the word
		; shift data to the top of register
	SUB	m2, r0, m2	; -m2 (a negative number)
	ADDI	m2, m2, #31	; m2 = 31 - (original m2)
	SLLV	m0, m0, m2	; shift interesting data to top of register
		; case analysis
	BEQZ	m3, ELSE1	; IF unsigned THEN goto ELSE1
	BNEZ	m1, ELSE2	; IF halfword THEN goto ELSE2
	SRAI	m0, m0, #24	; m0 = m0 >> 24 (Arithmetic shift)
	RFE			; finished
ELSE2	SRAI	m0, m0, #16	; m0 = m0 >> 16 (Arithmetic shift)
	RFE			; finished
ELSE1	BNEZ	m1, ELSE3	; IF halfword THEN goto ELSE3
	SRLI	m0, m0, #24	; m0 = m0 >> 24 (Logical shift)
	RFE			; finished
ELSE3	SRLI	m0, m0, #16	; m0 = m0 >> 16 (Logical shift)
	RFE			; finished

8 instructions will always be executed (once you include a right shift
and rfe).  The only questions is how many branches.  If all accesses
are equally likely, you will execute the unsigned branch 50% of the
time, and a halfword branch 50% of the time.  So
	(pipeline penalty)	  5
	(necessary code)	  8
	(branches)		+ 1 = (0 (0.25) + 1 (0.25 + 0.25) + 2 (0.25))
	total			 14 cycles (approx.)

b)

type of access          select1         select2         select3         select4

full word
  signed                M[0-7]          M[8-15]         M[16-23]        M[24-31]
  unsigned              M[0-7]          M[8-15]         M[16-23]        M[24-31]
 
half word in bottom half of memory bus (bits 0-15)
  signed                M[0-7]          M[8-15]         bp1 x 8         bp2 x 8
  unsigned              M[0-7]          M[8-15]         0 x 8           0 x 8
 
half word in top half of memory bus (bits 16-31)
  signed                M[16-23]        M[24-31]        bp1 x 8         bp2 x 8
  unsigned              M[16-23]        M[24-31]        0 x 8           0 x 8
 
byte in bottom byte of memory bus (bits 0-7)
  signed                M[0-7]          bp0 x 8         bp1 x 8         bp2 x 8
  unsigned              M[0-7]          0 x 8           0 x 8           0 x 8
 
byte in second byte of memory bus (bits 8-15)
  signed                M[8-15]         bp0 x 8         bp1 x 8         bp2 x 8
  unsigned              M[8-15]         0 x 8           0 x 8           0 x 8

byte in third byte of memory bus (bits 16-23)
  signed                M[16-23]        bp0 x 8         bp1 x 8         bp2 x 8
  unsigned              M[16-23]        0 x 8           0 x 8           0 x 8
 
byte in top byte of memory bus (bits 24-31)
  signed                M[24-31]        bp0 x 8         bp1 x 8         bp2 x 8
  unsigned              M[24-31]        0 x 8           0 x 8           0 x 8


c) No, the four MUXs do not operate in parallel because 3 of the MUXs depend
   on the output from a previous MUX.  So there is a serialized order in which
   the MUXs operate.  WIth only enough time to access the cache and get the
   signal through a single MUX, the memory access time must be increased
   beyond one cycle time.

d) Here are just a few (the first two are key issues):
     - increased cycle time using new hardware vs. the total number of cycles
	needed using old microcode for unaligned accesses
     - frequency of unaligned accesses in programs (ie. word processors
	perform many byte operations)
     - additional MUX hardware required for the second version vs
	microcode store for the "software" soln
     - design time to create the new hardware
     - testing time to make sure the design engineer didn't mess up
CPSC 418 Home Page
Last modified: Feb 97