112 possible marks Marked out of 100 --------------------------------------------------------------------------- 1) Factors Affecting Performance (18 pts) 4 pts for each of the 3 terms in part (a) 2 pts for each of the 3 terms in part (b) ---------------------------------------- a) ISA effects (machine instr)/(program): The more tasks an instruction can perform, the fewer instructions are needed in a program. For example: a memory-memory machine like the VAX can use a single instruction to add the values stored in two memory locations, and place the result in a third. A load-store ISA such as the MIPS would require 4 instructions to perform a similar feat (2 loads, an add, and a store). (cycles)/(machine instr) The more inherently sequential tasks an instruction can perform, the more cycles that instruction will take to complete. For example, the memory-memory VAX instruction listed above requires two accesses to memory, an addition, and another access to memory to be completed sequentially (even if the address calculations are carried out in parallel and behind the scenes). It will therefore take at least 4 cycles to complete. If average instructions of the ISA complete many tasks, and thus take many cycles, the average CPI of the ISA will increase. (time)/(cycle) This term is still affected by ISA choices, although less so than the previous two. The maximum time to execute a subtask of an instruction determines the cycle length. For example, in many older processors, the instruction decoding time was a determining factor in cycle time: CISC instructions took a long time to decode, decoding could not be spread over multiple cycles, and so it fixed the cycle time. Another ISA effect that could influence cycle length is the inclusion or exclusion of instructions requiring length execution. For example, long shifts are very slow. Some instruction sets chose to limit the length of shifts to reduce cycle time (a long shift was the slowest operation in the ALU). Some early RISC machines chose to implement multiply or divide as multiple instructions as well, to reduce the maximum ALU time. ---------------------------------------- b) Most important other effects (machine instr)/(program) a good compiler may be able to reduce the size of a program (or it may choose to increase the size to reduce execution time -- loop unrolling is an example) (cycles)/(machine instr) processor organization: pipelining, multiple execution units, out-of-order execution, etc. (time)/(cycle) process technology: smaller feature size gives faster machine, different process technologies (CMOS, BiCMOS, GaAs, etc.) run faster or slower. --------------------------------------------------------------------------- 2) Reducing the Instruction Set (24 pts -- 8 each) I will use a MIPS R2000 style processor as my example RISC (ie example pipeline, ISA, etc.). first two reasons up to 3 marks each third reason up to 2 marks ---------------------------------------- a) Autoincrement addressing mode - complicates interrupt handling if the address register is incremented before all interrupts have been detected -- the instruction may cause an exception after the processor state is modified. - relatively easy to synthesize out of two simple instructions: a load using the same address register, followed by an appropriate addition on that address register. - possible structural hazard writing both the loaded value and the incremented value back to the register file. - possible WAW and WAR hazards if the address register is written too early in the pipeline (RISC pipelines like the MIPS read early and write late to avoid these hazards). - possible structural hazard calculating the load address and incrementing the register in the ALU if autoincrement with displacement is allowed. ---------------------------------------- b) Absolute addressing for branch targets - very difficult to encode in a fixed instruction length: an absolute address which includes the entire address space will be 32 to 64 bits long, which is as long or longer than a RISC instruction (leaving no room for the opcode). - easy to synthesize out of three simple instructions: a load high to load the 16 high bits of the address into a temporary register, an addi to load the 16 low bits of the address into the same temporary register, and a jr to jump to the address pointed to by the register. - very infrequently needed: most branches use PC relative to address their target, so that the code can be easily relocated in memory. ---------------------------------------- c) String Move - interrupt very difficult to handle during the move: what if the source or target string overlap a page boundary? How can the current (partially finished) state of the instruction be saved? - easily synthesized with a short loop (load byte, store byte, increment pointers, branch if not done). The branch is not crippling because it can be accurately predicted or the loop can be unrolled. - very difficult to pipeline with other instructions, since each byte moved will require several passes through address calculation and memory access stages (structural hazards). The string move would have to grab the execution pipeline and use it over and over. - data hazards may develop if the source and destination strings overlap in memory. --------------------------------------------------------------------------- 3) RISC and CISC. What lousy names (12 pts) 8 for the differences (2 each) 4 for the PPC explanation (2 for each reason which supports your argument) RISC features: - fixed length instructions with simple formats - load-store architecture: ALU operates only on registers - simple addressing modes to memory - lots of registers RISC optional features: - no instruction side-effects: an instruction affects at most 1 register - aligned accesses - low latency branches CISC features: - variable length instructions, with many complex formats - memory-memory architecture: ALU operates or registers or memory - lots of addressing modes - instruction side-effects: autoincrement addressing, condition codes, etc. The key objective of RISC machines: easy pipelining The key objective of CISC machines: minimize instructions I would support my statement: the PPC has all of the RISC features (although none of the optional ones). In particular, it has fixed length instructions and all ALU operands must be in registers, which makes it decidedly different than the VAX. --------------------------------------------------------------------------- 4) Faster Processors (25 pts -- 5 each) 2 for the problem, 3 for the reason. ---------------------------------------- a) ROB: Reorder Buffer - allows instructions to be retired (comitted) in-order despite out-of-order execution, which in turn allows precise interrupts even if instructions complete out-of-order. When an instruction dispatches (in-order) it reserves the next slot in the ROB. When an instruction completes execution, its result is written to the ROB, not the register file. Only when all preceeding instructions are finished and written to the register file, may an instruction write its own result. When an interrupt occurs, the register file is in a state consistent with sequential execution up until (but not including) the instruction at the head of the ROB. - avoids WAW and WAR hazards. If instructions can pull their source operands from the ROB instead of the register file, and results are written to the ROB first, then instructions issuing out-of-order will be able to get/put the appropriate value from/to the ROB or register file without worrying about instructions which should have executed previously or subsequently. ---------------------------------------- b) multiple integer ALUs - avoids structural hazards caused by several different instructions, possibly in different pipeline stages, all needing integer arithmetic at once. In the MIPS R2000, for example, three separate ALUs allow a branch in ID to both determine its condition and its target at the same time another instruction is performing a general ALU operation in the ALU stage. More advanced processors may have many instructions in execution at once, each needing an ALU op. ---------------------------------------- c) Tomasulo's algorithm and its derivatives: reservation stations - allows out-of-order instruction issue. Multiple instructions, all unable to issue due to RAW hazards, may wait in reservation stations after dispatch but before issue. All the stations watch the results coming from the execution units; any instructions needing a result will load the result into a field of the reservation station. Any instruction in a reservation station with all its operands may issue, even it that issue will take it out of program order. ---------------------------------------- d) separate ALU and MEM pipeline stages - allows displacement addressing. If loads/stores do not have access to an ALU before making a memory request, the address must be an absolute address present in the instruction or an address pointed to directly by a register, since it is impossible to add an offset to the register's value without an ALU. ---------------------------------------- e) explicitly set instead of implicitly set condition codes - allows out-of-order execution without worrying about incorrect branching. When instructions set various condition codes as a side effect, and those condition codes are used to determine branch conditions, then the processor must ensure that the condiiion codes read by the branch are the codes that would have been set under sequential execution. With explicitly set condition codes it is very easy for the compiler to inform the processor which instruction sets the condition codes for a particular branch -- the nearest preceeding instruction which explicitly sets the condition codes. With implicitly set condition codes, the processor may be unable to determine which preceeding instruction set the condition code, and may have to resort to sequential execution. --------------------------------------------------------------------------- 5) Instruction Level Parallelism (15 pts -- 5 each) ---------------------------------------- a) An ISA in which the execution of each instruction is assumed to be completed before the next is started. Sequential execution ISAs give the illusion that no parallel execution is occurring. ---------------------------------------- b) They both assume that several instructions may be executed in parallel, and require the programmer or compiler to generate code which is correct despite parallel execution. The VLIW architecture requires an instruction to specify which other instruction(s) may be executed in parallel with it. The dataflow architecture requires an instruction to specify which other instruction(s) must execute before it. ---------------------------------------- c) The MIPS R2000 fails to support the sequential execution architecture assumption because it uses delayed branches -- the instruction directly after a branch will be executed regardless of the outcome of the branch. A delayed branch can be a problem if an exception occurs during the execution of the instruction in the delay slot. When an exception occurs, normally the excepting instruction's PC is saved, so that execution can resume at that location after handling the exception. Upon resumption, instructions are assumed to follow sequentially from the excepting instruction. If the excepting instruction was in the delay slot of a taken branch, however, then the correct instruction to place directly after the excepting instruction will not be the next sequential instruction, but rather the target of the branch. The MIPS R2000 solves this problem by saving the PCs of the next two instructions when an exception occurs. --------------------------------------------------------------------------- 6) All these funky units... (18 pts) lose 2 for an incorrect or missed stage, lose 1 if why is incorrect unit stage(s) why a 1 to fetch the instructions from the physically addressed cache based on the virtually addressed PC b 2 for register renaming to deal with WAW and WAR hazards 5 for keeping the appropriate mapping between logical and physical registers (based on which instructions have been retired) in case of an interrupt c 3 to keep track of which instructions are still waiting for operands and may not issue yet d 4 to execute the various possible instructions e 2 to provide sources for instructions 5 to provide destinations for instructions f 4 to provide the memory port for executing loads and stores 5 returning the result for loads and determining whether stores should proceed g 1 to provide for icache misses 4 to provide for dcache misses ---------------------------------------------------------------------------CPSC 418 Home Page
Last modified: Apr 97