Prev | CPSC 418 Home Page | Next |
Questions 2-5 require figures 3.32 and 3.23 from HP (available from me or the reading room). No other material from HP is required. The question numbers from HP are just references -- there is no need to look up the question in HP.
Question 6 requires figures 1 and 2, available from me or in postscript format.
HOMEWORK ASSIGNMENT #1 Out: Tues Jan 28/97 Due: Tues Feb 11/97 CS 418 Spring Session 1997 Marks: 100 ------------------------------------------------------------------------------ Question 1: Different basic architectures (20 marks) (adapted from HP Q 3.5, p.134) This question will compare the memory and instruction encoding efficiency of four different styles of instruction sets for a short code sequence. The architecture styles are: Accumulator -- one register available for ALU (A), one for addresses (P). Instructions can use 1 immediate of any length. Only two operand instructions are supported, and one operand must be A or P. ALU results must always go in A; loads/stores must always use an immediate address or P, and the data must be A or an immediate (note that two immediates are not allowed). Data can be moved from A to P and back. Instructions are 1 byte, plus the length of any immediates needed. Stack -- All operations occur on the top of the stack. Only push and pop may have an immediate (either an address or a data value), and all other instructions remove their operands from the stack and replace them with the result. The first four entries of the stack can be stored in the processor; if the stack grows deeper then entries will be implicitly moved to memory by the processor (no extra instructions needed, but the memory references will stall instructions in process and must be included in the number of memory references). There is a DUP operation that duplicates the top item on the stack. Instructions take one byte, plus the length of any immediate. Memory/memory -- All three operands may be in memory. There are 4 general purpose registers d0-d3 available for use (the others are already being used by other parts of the program). Instructions take 1 byte for opcode 4 bits for each register specifier immediate length for each immediate specifier You may use any VAX instruction listed in PH or HP. All instructions contain an integer number of bytes, but may require only 2 operands. Load/store -- All operations occur in registers, and register-register instructions have three operands. There are 8 general purpose registers R0-R7 available for use. The immediate addresses used in this example are short enough to fit in an instruction. All instructions are 4 bytes. The addresses AddrA, AddrD, and AddrE can be viewed as 2 byte immediates for the accumulator, stack and memory/memory machines. A load or store instruction in the load/store machine will fit one of these addresses. All the data are 32 bit unsigned integers. Using suitable assembly languages (destination registers always come first) and the register identifiers shown above, translate the following C code: if (E & 0x00000001) // if bottom bit of E is set A = A + D; D = D + D; // double D E = E >> 1; // shift E right 1 bit All data is in memory to start with, and all registers/stacks are empty. Give the assembly for each architecture. How much program storage is required by each architecture (instruction bytes)? How much memory bandwidth is required by each architecture (instruction bytes + data bytes)? Rank the architectures in terms of code size and memory bandwidth requirements. Now consider the following code fragment (DO NOT translate into assembly): for(i = 0; i < 100; i++) { A[i] = 0; D = B[i]; E = C[i]; while(E > 0) { if (E & 0x00000001) // if bottom bit of E is set A[i] = A[i] + D; D = D + D; E = E >> 1; // shift E right 1 bit } } This code produces A[] = B[] * C[] without using multiplication. If this code were compiled onto each of the architectures listed above, would your rankings for code size and/or memory bandwidth change? Justify your answer in a few sentences. ------------------------------------------------------------------------------ Question 2: Number of registers, adding caches, price/performance (15 marks) (taken from HP Q 3.7, p.135) We have a proposal for three different machines -- M0, M8 and M16 -- that differ in their register count. All three machines have three operand instructions, and any operand can be either a memory reference or a register. The cost of a memory operand on these machines is six cycles and the cost of a register is one cycle. Each of the three operands has equal probability of being in a register (this is a somewhat kludgy memory-memory architecture) The differences among the machines are described in the following table. The execution cycles per operation are in addition to the cost of operand access. The probability of an operand being in a register applies to each operand individually and is based on figures 3.28 (p.117) and 3.29 (p.118). execution cycles probability of machine registers excluding operand an operand being access time in a register M0 0 4 cycles 0.0 M8 8 5 cycles 0.5 M16 16 6 cycles 0.8 So, for example, if M8 executed an instruction where 2 of 3 operands were in registers (probability 0.375) it would take 13 cycles (5 + 1*6 + 2*1= 13). What is the cycle count for an average instruction in each machine? Now we consider adding a cache to each machine. On the old machines, memory access costs were: 1 cycle to calculate address ALWAYS 5 cycles to access memory ALWAYS for a total of 1 + 5 = 6. The new cache will take an extra cycle to check on all references, but memory will NOT have to be accessed after that extra cycle 90% of the time: 1 cycle to calculate address ALWAYS 1 cycle to check cache ALWAYS 5 cycles to access memory 10% of the time With the cache added, what is the average number of cycles to access a memory operand? With the new cache, what is the cycle count for an average instruction on each machine? The chip for each machine, without a cache, costs $500. Adding a cache to the chip increases the cost of the chip by $100. What are the price/performance values for each of the six chips, and which is the best chip? ------------------------------------------------------------------------------ Question 3: Adding Indexing Mode (15 marks) (taken from HP Q 4.9 p.194) Consider adding a new index addressing mode to DLX (remember that DLX is basically identical to MIPS). The addressing mode adds two registers and an 11-bit signed offset to get the effective address. Our compiler will be changed so that code sequences of the form ADD R1, R1, R2 LW Rd, 0(R1) (or stores) will be replaced with a load (or store) using the new addressing mode. Use the overall average instruction frequencies from figure 3.32 in evaluating this modification (the l/s machine in this figure is the DLX). a) Assume that the addressing mode can be used for 10% of the loads and stores (accounting for both the frequency of this type of address calculation and the shorter offset). What is the ratio of instruction counts on the enhanced DLX (eDLX) compared to the original DLX (ie, for 1 instruction on DLX, how many are executed on eDLX). b) If the new addressing mode lengthens the clock cycle by 5% (extra time to access the register file twice occasionally), which machine is faster and by how much? ------------------------------------------------------------------------------ Question 4: Instruction Length (20 marks) (taken from HP Q 3.3 p.133) We are designing instruction set formats for a l/s architecture and are trying to decide whether it is worthwhile to have multiple offset lengths for branches and memory references. We will use average measurements for the three benchmarks (TeX, Spice, GCC) to make the decision. We have decided that the offsets will be the same for these two classes of instructions. The length of an instruction would be equal to 16 bits + offset length in bits. ALU instructions will be 16 bits. From averaging and accumulating figures 3.13 (p.100) and 3.19 (p.106), the following table has been created. Note that the table only considers the magnitude of the displacement -- an additional bit will be needed for the sign bit. We assume that instructions will always be an integral number of bytes in length, and so only appropriate offsets are shown below. offset bits culmulative data refs cumulative branches 0 16% 0% 7 62% 89% 15 100% 100% (if you are interested, the rest of the table is in HP p.133) For instruction set frequencies, average the data for the three benchmarks in figure 3.32. a) Using a variable length instruction format, instructions can be either 16, 24 or 32 bits long. What is the average length of an executed instruction? b) Suppose we chose a fixed-length format of 24 bits (for everything including ALU ops). Offsets longer than 8 bits will require an additional 24 bits. For every 1000 bytes fetched by the variable length machine in (a), how many bytes will be fetched by this machine? c) Suppose we chose a fixed-length format of 32 bits. All instructions will now be 4 bytes. For every 1000 bytes fetched by (a), how many will be fetched by this machine? d) Suppose we went back to a variable length format, but this time all instructions were either 16 bits (ALU & offset length 0 instructions) or 32 bits (all instructions requiring longer offsets). For every 1000 bytes fetched by (a), how many will be fetched by this machine? e) Assume that all machines have a constant time/cycle and instructions/program. All machines take 1 cycle/byte to process instructions, plus 2 extra cycles/instruction to decode for machine (a) and 1 extra cycle/instruction to decode for machine (d) -- the fixed length machines need no extra decode time. What are the relative performances of the 4 machines (time/program)? ------------------------------------------------------------------------------ Question 5: Byte and Halfword accesses (10 marks) (HP Q 3.6, p.135) Supporting byte and halfword access requires an alignment network. Some machines have only word accesses, so that a load of a byte of halfword takes two instructions (a load an an extract), and a partial word store takes three instructions (load, insert, store). Use the data for the TeX benchmark from figure 3.23 to determine what percentage of the accesses are to byte or halfwords, and use the data from TeX on the l/s machines from figure 3.32 to find the frequency of data transfers. Assume that loads are twice as frequent as stores independent of data size. If all instructions on the machine take one cycle, what increase in the clock rate must we obtain to make eliminating partial word accesses a good tradeoff? ------------------------------------------------------------------------------ Question 6: Halfword/byte accesses (20 marks) We have designed an ISA in which halfword/byte accesses are allowed; however, the program writer is warned that some implementations may not handle these accesses directly in hardware (but rather using some kind of multicycle microcode). After creating our first processor in the ISA using this microcode solution, we are convinced by the PowerPC argument and ask our designers to handle byte/halfword accesses in hardware for our next generation chip. This question ignores the problem of stores -- we are only dealing with halfword/byte loads. If you don't know what a multiplexer is, look at the end of the question or ask a friend from EE. a) the microcode design The old bus design is shown in figure 1. Halfword or byte accesses trap into a microcode routine (the microcode is effectively hardwired assembler instructions). The microcode has to perform 4 steps 1) align the access to a word boundary 2) load the word 3) mask out any unwanted bits 4) shift the wanted bits to the bottom of the register At the beginning of the microcode routine, the address is loaded into a microcode register m0 (which is not visible to a regular assembler programmer). Assuming that you have four microcode registers m0-m3 to work in (but you are not allowed to change the general purpose registers), write a short routine using standard MIPS assembler (no loading or storing bytes or halfwords though!) to perform the four tasks above. You must handle both unsigned and signed values: - An unsigned value is extended to 32 bits by zeros on the left - A signed value is extened to 32 bits by the value of its most significant bit on the left Note that MIPS shift right logical puts zeros into the top of the register, while shift right arithmetic copies the highest bit currently in the register (ie the sign bit is preserved). Shift left logical puts zeros into the bottom of the register -- there is no shift left arithmetic. When you are finished, leave the result in microcode register m0. Assuming that the entire pipeline has to be flushed on this type of trap (a 5 cycle penalty), what is the total penalty (in cycles) for an unaligned access (don't forget the return from interrupt (RFI) statement at the end of your microcode)? b) the hardware design The new design in shown in figure 2. All the lines are 8 bits wide except: 1) select lines, which are 2 bits to select among the 4 inputs 2) bp lines, which bypass the top bit from each byte to the next higher byte, thus allowing sign extension for signed integers. The select lines are set according to the following table type of access select1 select2 select3 select4 full word signed M[0-7] M[8-15] M[16-23] M[24-31] unsigned M[0-7] M[8-15] ? ? half word in bottom half of memory bus (bits 0-15) signed M[0-7] M[8-15] bp1 x 8 bp2 x 8 unsigned M[0-7] M[8-15] ? ? half word in top half of memory bus (bits 16-31) signed M[16-23] M[24-31] ? ? unsigned ? ? ? ? byte in bottom byte of memory bus (bits 0-7) signed M[0-7] bp0 x 8 bp1 x 8 bp2 x 8 unsigned ? ? ? ? byte in second byte of memory bus (bits 8-15) signed ? ? ? ? unsigned ? ? ? ? byte in third byte of memory bus (bits 16-23) signed ? ? ? ? unsigned ? ? ? ? byte in top byte of memory bus (bits 24-31) signed ? ? ? ? unsigned ? ? ? ? Fill in the table with the missing information. c) timing the new design Our designers tell us that we can generate an appropriate address for the cache by ignoring the bottom two bits of the address (thus generating a word aligned address). While the cache access is taking place, the select lines can be set according to the lookup table above. Once the cache access is complete, all that remains is to gate the data through the multiplexers and onto the register bus. Our designers figure that as long as the select lines are set faster than the cache access, they won't have to increase the cycle time -- there is just barely enough time to access the cache and get the signal through a single MUX. They conclude that adding the logic will not increase the memory access time (on a cache hit) beyond one cycle. Do you agree with them? If not, why? d) decision time What factors should you examine to determine which method (old microcode or new hardware) is better? Don't forget to list the factors already considered in parts (a) - (c). You shouldn't have to explain the items -- an answer less than a paragraph in total will do. What is a Multiplexer (MUX)? A fairly simple piece of hardware. It provides 1 output, 2^n inputs, and n select lines. Based on the digital value of the select lines, one of the inputs is fed to the output (n bits has 2^n different values). The MUXes in this question are 1 x 4; they use 2 select lines to choose among 4 possible inputs. In practice, the inputs and outputs are interchangeable. ------------------------------------------------------------------------------ Question 7: Evaluation For each of the questions 1-6, how long did it take to answer? Do you have any other comments about the assignment? ------------------------------------------------------------------------------ Addenda -- fixes and additions ----------------------------------------- In order to answer (6a), you need to know (1) whether the access is byte or halfword (2) where the data lies in the word (3) whether to treat it as signed or unsigned I didn't really mean for you to write an instruction decoding routine. Therefore, when the trap routine is entered, the microcode registers contain m0 - the address of the data (as before) m1 - a 0 if the access is byte a 1 if the access is halfword m2 - location of the most significant bit of data (7, 15, 23 or 31) m3 - a 0 if the data is unsigned a 1 if the data is signed For example, a signed byte at address 0xa96e2731 (ie the second byte on the memory bus) would set the microcode registers as: m0 0xa96e2731 m1 0 m2 15 m3 1 You may overwrite these register during your routine. Be careful though -- there is no way to get the information back once you overwrite. Note that this architecture is Little Endian. ----------------------------------------- For question (1), there are some clarifications >1. Does a branch or jump label count as an immediate? If so, how long is it? > If not, then what is it? >E.g. > jump Loop1 The branch or jump label counts as an address; therefore, it is 2 bytes on the stack, accumulator and memory/memory architectures and fits within the 4 byte instruction on the load/store machine. >2. For the accumulator architecture, it says "Only two operand instructions > are supported, and one operand must be A or P." Is the A or P operand > implied? I.e. do we write "Load AddrE" or "Load P, AddrE"? > What is the length of the register specifier for A or P? Get rid of any references to the P register. Load/store instructions will take an immediate address, with the data to/from A. The A register is implied in the instruction (and no register specifier is needed). The P register was meant to simulate a data pointer register, for register+offset addressing. But we are already simulating that with 16 bit addresses. Therefore, pretend there is only an accumulator. >3. I'm not sure what is meant by data bytes. Isn't the length of the > data already included in the instruction length? > E.g. "AND d0, AddrE, 0x00000001" has a length of 1 byte (opcode) + > 4 bits (reg. specifier) + 32 bits (immediate)" in the > memory/memory architecture. Immediate data is already included in the instruction length. When you specify an address though, you have a 16 bit immediate plus a 32 bit data reference to memory. The instruction above has the following stats: AND 8 bits d0 4 bits AddrE 16 bits 0x00000001 32 bits total 60 bits 60 is rounded up to 64 (8 bytes), since the instruction must be an integer number of bytes. So the instruction length is 8 bytes. When the instruction is executed, however, the data at AddrE is read. That data is 32 bits (4 bytes) long (as are all data values). So the total memory bandwidth is 8 + 4 = 12 bytes. ----------------------------------------- Another clarification for question 1: Another clarification on the assignment was requested in class today. The assignment is not entirely clear on how branch statements are implemented on the stack and accumulator architectures. These architectures are both old, so we will use an old style branch statement. br* label where * is one of eq, ne, lt, gt, le, ge label is the label in your program to branch to label can be considered a 16 bit immediate address. The comparison is always done with 0: Stack (top of stack) * 0 Accumulator A * 0 where * is a comparison. The stack version also removes the top element from the stack after comparing it to zero. For example, on the accumulator, if you would like to branch to a label LOOP if A is above zero, use the statement brgt LOOP ; if A > 0, branch to LOOP If you want to compare with another constant, use a subtract statement before the branch. For example, to branch to R2 if the top of stack is not equal to 5, use the statements (tos = top of stack) push 5 ; put five on top of stack swap ; get the stack in the right order for (old tos) - 5 sub ; perform subtraction brne R2 ; branch if tos != 0 ; note that tos is popped by this branch In practice, these architectures all used condition codes. But I don't want to deal with the rules for setting condition codes, and this branch technique is equivalent for the code sequence you are writing. -----------------------------------------
Prev | CPSC 418 Home Page | Next |