CPSC 418 Home Page
CPSC 418: Assignment #1

Questions 2-5 require figures 3.32 and 3.23 from HP (available from me or the reading room). No other material from HP is required. The question numbers from HP are just references -- there is no need to look up the question in HP.
Question 6 requires figures 1 and 2, available from me or in postscript format.

                        HOMEWORK ASSIGNMENT #1

Out: Tues Jan 28/97					Due: Tues Feb 11/97
CS 418							Spring Session 1997
Marks: 100
------------------------------------------------------------------------------

Question 1: Different basic architectures (20 marks)
(adapted from HP Q 3.5, p.134)

This question will compare the memory and instruction encoding
efficiency of four different styles of instruction sets for a short
code sequence.  The architecture styles are:

Accumulator -- one register available for ALU (A), one for addresses
(P).  Instructions can use 1 immediate of any length.  Only two
operand instructions are supported, and one operand must be A or P.
ALU results must always go in A; loads/stores must always use an
immediate address or P, and the data must be A or an immediate (note
that two immediates are not allowed).  Data can be moved from A to P
and back.  Instructions are 1 byte, plus the length of any immediates
needed.

Stack -- All operations occur on the top of the stack.  Only push and
pop may have an immediate (either an address or a data value), and all
other instructions remove their operands from the stack and replace
them with the result.  The first four entries of the stack can be
stored in the processor; if the stack grows deeper then entries will
be implicitly moved to memory by the processor (no extra instructions
needed, but the memory references will stall instructions in process
and must be included in the number of memory references).  There is a
DUP operation that duplicates the top item on the stack.
Instructions take one byte, plus the length of any immediate.

Memory/memory -- All three operands may be in memory.  There are 4
general purpose registers d0-d3 available for use (the others are
already being used by other parts of the program).  Instructions take
	1 byte for opcode
	4 bits for each register specifier
	immediate length for each immediate specifier 
You may use any VAX instruction listed in PH or HP.  All instructions
contain an integer number of bytes, but may require only 2 operands.

Load/store -- All operations occur in registers, and register-register
instructions have three operands.  There are 8 general purpose
registers R0-R7 available for use.  The immediate addresses used in
this example are short enough to fit in an instruction.  All
instructions are 4 bytes.

The addresses AddrA, AddrD, and AddrE can be viewed as 2 byte
immediates for the accumulator, stack and memory/memory machines.  A
load or store instruction in the load/store machine will fit one of
these addresses.

All the data are 32 bit unsigned integers.

Using suitable assembly languages (destination registers always come
first) and the register identifiers shown above, translate the
following C code:

	if (E & 0x00000001)	// if bottom bit of E is set
	    A = A + D;
	D = D + D;		// double D
	E = E >> 1;		// shift E right 1 bit

All data is in memory to start with, and all registers/stacks are
empty.  Give the assembly for each architecture.

How much program storage is required by each architecture (instruction
bytes)?

How much memory bandwidth is required by each architecture
(instruction bytes + data bytes)?

Rank the architectures in terms of code size and memory bandwidth
requirements.  Now consider the following code fragment (DO NOT
translate into assembly):

   for(i = 0; i < 100; i++)
   {    A[i] = 0;
	D = B[i];
	E = C[i];
	while(E > 0)
	{   if (E & 0x00000001)	// if bottom bit of E is set
	        A[i] = A[i] + D;
	    D = D + D;
	    E = E >> 1;		// shift E right 1 bit
	}	
   }

This code produces A[] = B[] * C[] without using multiplication.  If
this code were compiled onto each of the architectures listed above,
would your rankings for code size and/or memory bandwidth change?
Justify your answer in a few sentences.

------------------------------------------------------------------------------

Question 2: Number of registers, adding caches, price/performance (15 marks)
(taken from HP Q 3.7, p.135)

We have a proposal for three different machines -- M0, M8 and M16 --
that differ in their register count.  All three machines have three
operand instructions, and any operand can be either a memory reference
or a register.  The cost of a memory operand on these machines is six
cycles and the cost of a register is one cycle.  Each of the three
operands has equal probability of being in a register (this is a
somewhat kludgy memory-memory architecture)

The differences among the machines are described in the following
table.  The execution cycles per operation are in addition to the cost
of operand access.  The probability of an operand being in a register
applies to each operand individually and is based on figures 3.28
(p.117) and 3.29 (p.118).

				execution cycles	probability of
machine		registers	excluding operand	an operand being
				access time		in a register

M0		0		4 cycles		0.0
M8		8		5 cycles		0.5
M16		16		6 cycles		0.8

So, for example, if M8 executed an instruction where 2 of 3 operands
were in registers (probability 0.375) it would take 13 cycles
(5 + 1*6 + 2*1= 13).

What is the cycle count for an average instruction in each machine?

Now we consider adding a cache to each machine.  On the old machines,
memory access costs were:
	1 cycle to calculate address ALWAYS
	5 cycles to access memory ALWAYS
for a total of 1 + 5 = 6.

The new cache will take an extra cycle to check on all references, but
memory will NOT have to be accessed after that extra cycle 90% of the
time:
	1 cycle to calculate address ALWAYS
	1 cycle to check cache ALWAYS
	5 cycles to access memory 10% of the time
With the cache added, what is the average number of cycles to access a
memory operand?

With the new cache, what is the cycle count for an average instruction
on each machine?

The chip for each machine, without a cache, costs $500.  Adding a
cache to the chip increases the cost of the chip by $100.  What are
the price/performance values for each of the six chips, and which is
the best chip?

------------------------------------------------------------------------------

Question 3: Adding Indexing Mode (15 marks)
(taken from HP Q 4.9 p.194)

Consider adding a new index addressing mode to DLX (remember that DLX
is basically identical to MIPS).  The addressing mode adds two
registers and an 11-bit signed offset to get the effective address.

Our compiler will be changed so that code sequences of the form
	ADD R1, R1, R2
	LW  Rd, 0(R1)		(or stores)

will be replaced with a load (or store) using the new addressing mode.
Use the overall average instruction frequencies from figure 3.32 in
evaluating this modification (the l/s machine in this figure is the
DLX).

a) Assume that the addressing mode can be used for 10% of the loads
and stores (accounting for both the frequency of this type of address
calculation and the shorter offset).  What is the ratio of instruction
counts on the enhanced DLX (eDLX) compared to the original DLX (ie,
for 1 instruction on DLX, how many are executed on eDLX).

b) If the new addressing mode lengthens the clock cycle by 5% (extra
time to access the register file twice occasionally), which machine is
faster and by how much?

------------------------------------------------------------------------------

Question 4: Instruction Length (20 marks)
(taken from HP Q 3.3 p.133)

We are designing instruction set formats for a l/s architecture and
are trying to decide whether it is worthwhile to have multiple offset
lengths for branches and memory references.  We will use average
measurements for the three benchmarks (TeX, Spice, GCC) to make the
decision.  We have decided that the offsets will be the same for these
two classes of instructions.  The length of an instruction would be
equal to 16 bits + offset length in bits.  ALU instructions will be 16
bits.  From averaging and accumulating figures 3.13 (p.100) and 3.19
(p.106), the following table has been created.  Note that the table
only considers the magnitude of the displacement -- an additional bit
will be needed for the sign bit.

We assume that instructions will always be an integral number of bytes
in length, and so only appropriate offsets are shown below.

offset bits		culmulative data refs		cumulative branches
  0				 16%				  0%
  7				 62%				 89%
 15				100%				100%

(if you are interested, the rest of the table is in HP p.133)

For instruction set frequencies, average the data for the three
benchmarks in figure 3.32.

a) Using a variable length instruction format, instructions can be
either 16, 24 or 32 bits long.  What is the average length of an
executed instruction?

b) Suppose we chose a fixed-length format of 24 bits (for everything
including ALU ops).  Offsets longer than 8 bits will require an
additional 24 bits.  For every 1000 bytes fetched by the variable
length machine in (a), how many bytes will be fetched by this machine?

c) Suppose we chose a fixed-length format of 32 bits.  All
instructions will now be 4 bytes.  For every 1000 bytes fetched by
(a), how many will be fetched by this machine?

d) Suppose we went back to a variable length format, but this time all
instructions were either 16 bits (ALU & offset length 0 instructions)
or 32 bits (all instructions requiring longer offsets).  For every
1000 bytes fetched by (a), how many will be fetched by this machine?

e) Assume that all machines have a constant time/cycle and
instructions/program.  All machines take 1 cycle/byte to process
instructions, plus 2 extra cycles/instruction to decode for machine
(a) and 1 extra cycle/instruction to decode for machine (d) -- the
fixed length machines need no extra decode time.  What are the
relative performances of the 4 machines (time/program)?

------------------------------------------------------------------------------

Question 5: Byte and Halfword accesses (10 marks)
(HP Q 3.6, p.135)

Supporting byte and halfword access requires an alignment network.
Some machines have only word accesses, so that a load of a byte of
halfword takes two instructions (a load an an extract), and a partial
word store takes three instructions (load, insert, store).  Use the
data for the TeX benchmark from figure 3.23 to determine what
percentage of the accesses are to byte or halfwords, and use the data
from TeX on the l/s machines from figure 3.32 to find the frequency of
data transfers.  Assume that loads are twice as frequent as stores
independent of data size.  If all instructions on the machine take one
cycle, what increase in the clock rate must we obtain to make
eliminating partial word accesses a good tradeoff?

------------------------------------------------------------------------------

Question 6: Halfword/byte accesses (20 marks)

We have designed an ISA in which halfword/byte accesses are allowed;
however, the program writer is warned that some implementations may
not handle these accesses directly in hardware (but rather using some
kind of multicycle microcode).  After creating our first processor in
the ISA using this microcode solution, we are convinced by the PowerPC
argument and ask our designers to handle byte/halfword accesses in
hardware for our next generation chip.

This question ignores the problem of stores -- we are only dealing
with halfword/byte loads.

If you don't know what a multiplexer is, look at the end of the
question or ask a friend from EE.


a) the microcode design

The old bus design is shown in figure 1.  Halfword or byte accesses
trap into a microcode routine (the microcode is effectively hardwired
assembler instructions).  The microcode has to perform 4 steps
	1) align the access to a word boundary
	2) load the word
	3) mask out any unwanted bits
	4) shift the wanted bits to the bottom of the register

At the beginning of the microcode routine, the address is loaded into
a microcode register m0 (which is not visible to a regular assembler
programmer).  Assuming that you have four microcode registers m0-m3 to
work in (but you are not allowed to change the general purpose
registers), write a short routine using standard MIPS assembler (no
loading or storing bytes or halfwords though!) to perform the four
tasks above.

You must handle both unsigned and signed values:
	- An unsigned value is extended to 32 bits by zeros on the left
	- A signed value is extened to 32 bits by the value of its most
	    significant bit on the left
Note that MIPS shift right logical puts zeros into the top of the
register, while shift right arithmetic copies the highest bit
currently in the register (ie the sign bit is preserved).  Shift left
logical puts zeros into the bottom of the register -- there is no
shift left arithmetic.

When you are finished, leave the result in microcode register m0.

Assuming that the entire pipeline has to be flushed on this type of
trap (a 5 cycle penalty), what is the total penalty (in cycles) for an
unaligned access (don't forget the return from interrupt (RFI)
statement at the end of your microcode)?


b) the hardware design

The new design in shown in figure 2.  All the lines are 8 bits wide except:
	1) select lines, which are 2 bits to select among the 4 inputs
	2) bp lines, which bypass the top bit from each byte to the
	   next higher byte, thus allowing sign extension for signed
	   integers.
The select lines are set according to the following table

type of access		select1		select2		select3		select4

full word
  signed		M[0-7]		M[8-15]		M[16-23]	M[24-31]
  unsigned		M[0-7]		M[8-15]		   ?		   ?

half word in bottom half of memory bus (bits 0-15)
  signed		M[0-7]		M[8-15]		bp1 x 8		bp2 x 8
  unsigned		M[0-7]		M[8-15]		  ?		   ?

half word in top half of memory bus (bits 16-31)
  signed		M[16-23]	M[24-31]	  ?		   ?
  unsigned		   ?		   ?		  ?		   ?

byte in bottom byte of memory bus (bits 0-7)
  signed		M[0-7]		bp0 x 8		bp1 x 8		bp2 x 8
  unsigned		   ?		   ?		  ?		   ?

byte in second byte of memory bus (bits 8-15)
  signed		   ?		   ?		  ?		   ?
  unsigned		   ?		   ?		  ?		   ?

byte in third byte of memory bus (bits 16-23)
  signed		   ?		   ?		  ?		   ?
  unsigned		   ?		   ?		  ?		   ?

byte in top byte of memory bus (bits 24-31)
  signed		   ?		   ?		  ?		   ?
  unsigned		   ?		   ?		  ?		   ?

Fill in the table with the missing information.


c) timing the new design

Our designers tell us that we can generate an appropriate address for
the cache by ignoring the bottom two bits of the address (thus
generating a word aligned address).  While the cache access is taking
place, the select lines can be set according to the lookup table
above.  Once the cache access is complete, all that remains is to gate
the data through the multiplexers and onto the register bus.

Our designers figure that as long as the select lines are set faster
than the cache access, they won't have to increase the cycle time --
there is just barely enough time to access the cache and get the
signal through a single MUX.  They conclude that adding the logic will
not increase the memory access time (on a cache hit) beyond one cycle.
Do you agree with them?  If not, why?


d) decision time

What factors should you examine to determine which method (old
microcode or new hardware) is better?  Don't forget to list the
factors already considered in parts (a) - (c).  You shouldn't have to
explain the items -- an answer less than a paragraph in total will do.


What is a Multiplexer (MUX)?

A fairly simple piece of hardware.  It provides 1 output, 2^n inputs,
and n select lines.  Based on the digital value of the select lines,
one of the inputs is fed to the output (n bits has 2^n different
values).  The MUXes in this question are 1 x 4; they use 2 select
lines to choose among 4 possible inputs.

In practice, the inputs and outputs are interchangeable.

------------------------------------------------------------------------------

Question 7: Evaluation

For each of the questions 1-6, how long did it take to answer?

Do you have any other comments about the assignment?

------------------------------------------------------------------------------

Addenda -- fixes and additions

-----------------------------------------

In order to answer (6a), you need to know
	(1) whether the access is byte or halfword
	(2) where the data lies in the word
	(3) whether to treat it as signed or unsigned
I didn't really mean for you to write an instruction decoding routine.

Therefore, when the trap routine is entered, the microcode registers contain
	m0 -	the address of the data (as before)
	m1 -	a 0 if the access is byte
		a 1 if the access is halfword
	m2 -	location of the most significant bit of data (7, 15, 23 or 31) 
	m3 -	a 0 if the data is unsigned
		a 1 if the data is signed

For example, a signed byte at address 0xa96e2731 (ie the second byte
on the memory bus) would set the microcode registers as:
	m0	0xa96e2731
	m1	0
	m2	15
	m3	1

You may overwrite these register during your routine.  Be careful
though -- there is no way to get the information back once you
overwrite.

Note that this architecture is Little Endian.

-----------------------------------------

For question (1), there are some clarifications

>1.  Does a branch or jump label count as an immediate?  If so, how long is it?
>    If not, then what is it?
>E.g.
>    jump Loop1 

The branch or jump label counts as an address; therefore, it is 2
bytes on the stack, accumulator and memory/memory architectures and
fits within the 4 byte instruction on the load/store machine.


>2.  For the accumulator architecture, it says "Only two operand instructions
>    are supported, and one operand must be A or P."  Is the A or P operand
>    implied?  I.e. do we write "Load AddrE" or "Load P, AddrE"?
>    What is the length of the register specifier for A or P?

Get rid of any references to the P register.  Load/store instructions
will take an immediate address, with the data to/from A.  The A
register is implied in the instruction (and no register specifier is
needed).

The P register was meant to simulate a data pointer register, for
register+offset addressing.  But we are already simulating that with
16 bit addresses.  Therefore, pretend there is only an accumulator.


>3.  I'm not sure what is meant by data bytes.  Isn't the length of the
>    data already included in the instruction length?
>    E.g.  "AND d0, AddrE, 0x00000001" has a length of 1 byte (opcode) +
>           4 bits (reg. specifier) + 32 bits (immediate)" in the
>           memory/memory architecture.

Immediate data is already included in the instruction length.  When
you specify an address though, you have a 16 bit immediate plus a 32
bit data reference to memory.  The instruction above has the following stats:

AND             8 bits
d0              4 bits
AddrE           16 bits
0x00000001      32 bits
        total   60 bits

60 is rounded up to 64 (8 bytes), since the instruction must be an
integer number of bytes.  So the instruction length is 8 bytes.


When the instruction is executed, however, the data at AddrE is read.
That data is 32 bits (4 bytes) long (as are all data values).  So the
total memory bandwidth is 8 + 4 = 12 bytes.

-----------------------------------------

Another clarification for question 1:

Another clarification on the assignment was requested in class today.
The assignment is not entirely clear on how branch statements are
implemented on the stack and accumulator architectures.  These
architectures are both old, so we will use an old style branch
statement.

        br*     label

where
        *       is one of eq, ne, lt, gt, le, ge
        label   is the label in your program to branch to
label can be considered a 16 bit immediate address.

The comparison is always done with 0:
        Stack           (top of stack) * 0
        Accumulator     A * 0
where * is a comparison.  The stack version also removes the top
element from the stack after comparing it to zero.

For example, on the accumulator, if you would like to branch to a
label LOOP if A is above zero, use the statement
        brgt    LOOP    ; if A > 0, branch to LOOP

If you want to compare with another constant, use a subtract statement
before the branch.  For example, to branch to R2 if the top of stack
is not equal to 5, use the statements (tos = top of stack)
        push    5       ; put five on top of stack
        swap            ; get the stack in the right order for (old tos) - 5
        sub             ; perform subtraction
        brne    R2      ; branch if tos != 0
                        ; note that tos is popped by this branch

In practice, these architectures all used condition codes.  But I
don't want to deal with the rules for setting condition codes, and
this branch technique is equivalent for the code sequence you are
writing.



-----------------------------------------
CPSC 418 Home Page
Last modified: Feb 97