CPSC 418, Problem Set #3 Solutions Problem 1: ISA We've learned in this course the goal of defining an ISA: it provides a clean interface for what the programmer or compiler can assume about the processor, thereby allowing changes to software and hardware implementation without breaking software compatibility. Unfortunately, it's not always clear exactly what's part of the ISA and what's part of the implementation. For example, AMD's K6 processor line is supposed to be completely compatible with Intel's processors. When the K6 first came out, there was a problem with a Windows device driver that required a patch to fix. It turns out that this device driver was using a busy loop in software to generate a fixed time delay. In the K6, the loop instruction had been optimized, so this delay loop ran too fast on the K6. (a) Was the way Microsoft wrote the delay loop good programming practice? Were they obeying the ISA, or were they relying on implementation details? Justify your answer. Good programming practice? No. Not for a desktop computer. The only time delay loops like that are justified is in tiny embedded applications where you have no other means of getting timing information. Were they observing the ISA or exploiting implementation details? I prefer the answer that this was an implementation detail, but either answer is OK as long as you justify it properly. (Sort of like an arts class, eh?) For ISA: The ISA is what's programmer visible. If you look at a manual for one of these processors, it will tell you how long each instruction takes. Therefore, these are part of the spec and are visible to the programmer, and can be used by the programmer. (This argument applies more for embedded processors, but one can make the argument.) Against ISA: While the timings are in some sense visible, the whole point of an ISA is to allow various implementations. Obviously, we want the freedom for different implementations to run faster or slower, so it's against the spirt of the ISA to rely on the specific instruction timings for a specific implementation. (b) Was it OK for AMD to change the timing of the loop instruction? Was this a violation of the ISA? Justify your answer. OK to change? Sure. That's the whole point of making faster processors. Violation of ISA? You should be consistent with your answer in part (a). (c) Recently, AMD has introduced new 3D-Now instructions to perform faster single precision floating point. Similarly, Intel has introduced Streaming SIMD Extensions, which are new numeric instructions for graphics. Are these changes to the ISA? If so, how can one write code that will run on both platforms? Yes, these are changes to the ISA, as they add new instructions. Any code that uses the new instructions won't run on a machine that doesn't support them (unless you add special operating system support to trap the unimplemented instructions). If you want to be truly portable, you've got to avoid the added instructions. Otherwise, you could try relying on the operating system to encapsulate the new instructions in library routines (e.g. OpenGL) or provide software emulation. Problem 2: CISC vs. RISC As senior architect at Foobar Computer Corp., you've been charged with updating the Foobium II processor by adding Intel's new Streaming SIMD Extension instructions. Your design teams have come up with two options. Option 1 is that they can add functional units for the new instructions. This will allow the new instructions to execute in 1 CPI. Unfortunately, the added complexity reduces the clock frequency by ten percent. Option 2 is handle the new instructions in additional microcode, just as the complex x86 instructions are already handled. This will allow your new processor to run at full clock speed, but the average CPI for the new instructions is now 10. The CPI of the old instructions is unchanged. OOPS!!! I threw this problem together quickly and forgot to give you all required information. In particular, I neglected to tell you what the CPI was for the non-Streaming-SIMD-Extension instructions was. If you assumed a reasonable CPI (like 1) for the other instructions and worked the problem correctly from there, you get full credit. If you solved the problem correctly in terms of the unknown quantity, then you get some extra credit. Note that it's hard to attack this problem using Amdahl's Law, as the 5 percent information I gave you is for *instructions*, whereas Amdahl's Law is based on fraction of *time*. So, again, you'd need CPI information to convert them. (a) If in the future, you expect 5 percent of all instructions executed to be the new Streaming SIMD Extension instructions, which option gives you the faster processor? Solution 1 (assuming CPI for other stuff is 1): CPUTime = IC * CPI * clock period Option 1: CPUTime1 = IC * (0.95*1 + 0.05*1) * (clock period)/0.9 = 1.11*IC*(clock period) Note that it's wrong to multiply by 1.1. The clock *frequency* is 10% slower or 90% of the original frequency, so the clock *period* is 1/0.9. Option 2: CPUTime2 = IC * (0.95*1 + 0.05*10) * (clock period) = 1.45*IC*(clock period) So, Option 1 is much faster. Solution 2 (general solution, worth extra credit): CPUTime = IC * CPI * clock period Let OldCPI be the CPI for the unaffected instructions. Option 1: CPUTime2 = IC * (0.95*OldCPI + 0.05*1) * (clock period)/0.9 Option 2: CPUTime2 = IC * (0.95*OldCPI + 0.05*10) * (clock period) If you work out the numbers, you'll find that these two are the same when OldCPI is about 4.3 or so. So, if the OldCPI>4.3, then Option 2 is faster; otherwise, Option 1 is faster. (b) What percentage of code being new instructions is the crossover point: the point where Options 1 and 2 have the same average performance? Again, we have the choice of assuming some OldCPI value, like 1, or solving the general problem. Solution 1 (assuming CPI for other stuff is 1): CPUTime = IC * CPI * clock period Option 1: CPUTime1 = IC * ((1-x)*1 + x*1) * (clock period)/0.9 = 1.11*IC*(clock period) Option 2: CPUTime2 = IC * ((1-x)*1 + x*10) * (clock period) = (1+9x)*IC*(clock period) The crossover point is when these two are the same. Solving for x, we get x is about 1.2%. So, Option 2 is faster only if the new instructions are very uncommon (less than 1.2% of instruction mix). Solution 2 (general solution, worth extra credit): Option 1: CPUTime2 = IC * ((1-x)*OldCPI + x*1) * (clock period)/0.9 Option 2: CPUTime2 = IC * ((1-x)*OldCPI + x*10) * (clock period) So, we have ((1-x)*OldCPI + x)/0.9 = (1-x)*OldCPI + 10x and we just need to solve for x. I get: x = (1/9)*OldCPI / ((1/9)*OldCPI + 80/9) = 1 / (1 + 80/OldCPI) (c) The product manager (the marketing leader of the team) is worried the project is going over budget. He has just read an article about picoJava (the same one you read), where they described implementing really complex instructions as traps to the OS software. He recommends scrapping the project and simply selling Foobium II processors as new "Foobium III with Streaming SIMD Extension" processors, and handling the new instructions in software. There are obvious performance problems with this approach, but the product manager argues that no one is using the new instructions yet, so no one will notice the performance problems for a while. Would the "new" processor (relabeled Foobium II) implement the new Pentium III ISA (with Streaming SIMD Extensions)? Assuming appropriate software to handle the new instructions, would your "new" processor be able to execute applications programs that used the new instructions? Would the "new" processor be compatible at the system software level? Explain your answers. No, the new processor would not implement the new Pentium III ISA. We can always emulate instruction sets in software, but that doesn't change the ISA of the underlying processor. For an extreme example, I could take an old, old processor like a Z-80 or a 6502 or something, and write an interpreter for it to emulate a Pentium III. That doesn't make the old processor a Pentium III. The ISA refers to what the processor actually implements in hardware. With appropriate software support, the Foobium III could execute application software written for the Pentium III. Unimplemented instructions would get emulated in software by the operating system. (This is similar to how a Pentium can run Java byte code, or how a PowerPC-based Macintosh can run applications written for 68000-based Macintoshes.) No, the new processor would not be able to run system software, since it's the system software that provides the support to emulate the unimplemented instructions. Problem 3: Predication Assume integer registers r0..r31, and predicate registers p0..p31. You have predicate assignment statements like: peq p1, r1, r2 -- set p1 to true iff r1==r2 pne p2, r1, r2 -- set p2 to true iff r1!=r2 pge p2, r1, r2 -- set p2 to true iff r1>=r2 etc. You also have predicate logical operations like: pand p1, p2, p3 -- set p1 to be p2 AND p3 por p1, p2, p3 -- set p1 to be p2 OR p3 pnot p1, p2 -- set p1 to be NOT p2 All instructions can be predicated, e.g. fadd f1,f2,f3 (p1) -- if p1 then f1=f2+f3 Consider this code fragment: L1: bge r1, 1900, done: -- if r1>=1900 goto done: L2: blt r1, 30, y2k: -- if r1<=30 goto y2k: L3: addi r1, r1, 1900 -- r1+=1900 L4: br done -- goto done: y2k: L5: addi r1, r1, 2000 -- r1+=2000 done: Convert the above code to use predication. (BTW, there's a typo in the code for line L2 -- the mnemonic says strictly less than, but the comment says less than or equal. You may assume either.) Lots of possible solutions. The key is to get the same behavior, but to eliminate all branches. Here's one possibility: plt p1, r1, 1900 -- p1 is r1<1900 plt p2, r1, 30 -- p2 is r1<30 pnot p3, p2 -- p3 = !p2 pand p4, p1, p2 -- p4 = p1 && p2 pand p5, p1, p3 -- p4 = p1 && !p2 addi r1, r1, 1900 (p5) addi r1, r1, 2000 (p4) Problem 4: Power You are team leader for the new Furby2000 toy. One of the key design goals is full digital audio processing and speech recognition, performed by the single microprocessor in the toy. The primary performance design requirement is that the Furby2000 processor be able to process a minute of audio input within ten seconds. Your engineering team has greatly exceeded the performance goal: the prototype can process a minute of audio input in 1 second. Unfortunately, the prototype has a battery life of only 15 minutes. (a) Your goal is to reduce processor power consumption. The current prototype uses a processor running at 5 volts and 100Mhz, but you can scale the voltage down to 3.3 or 2.5 volts (forcing a slower clock frequency of 66 and 50Mhz respectively). There is also a lower performance processor available (with the same ISA and CPI) running at 16Mhz and consuming 1/8 of the power, running at 5 volts. This processor can also be scaled down to 3.3 or 2.5 volts, with the expected slow-down in clock speed. For each processor and voltage combination, estimate the performance and power consumption relative to the original prototype. You may ignore leakage current, and threshold voltage effects. P = V^2 * C * f We can make a table: Proc Voltage Clock Power Speed Fast 5V 100Mhz 1 1 This is the baseline. Fast 3.3 66 0.30 0.67 Clock frequency scales linearly with voltage, so the power gets reduced by a factor of (2/3)^2 because of the voltage times an additional (2/3) because of the frequency, giving 0.30. Fast 2.5 50 0.125 0.50 Slow 5V 16Mhz 0.125 0.167 The slower processor uses 1/8 of the power and runs 1/6 as fast. Slow 3.3V 10.7Mhz 0.037 0.107 Slow 2.5V 8Mhz 0.016 0.083 (b) If half of the total system power consumption is the microprocessor, (The other half is mainly the satellite transponder that relays collected marketing data to company headquarters.) what will be the battery life of the improved toy, given your best choice from part (a)? You may assume the battery holds a fixed total amount of energy, regardless of power drain (not true in reality). By our performance requirement, we can slow down by a factor of 10 at most (from 1 second to 10 seconds). Therefore, we can't use the Slow processor at 2.5V. The next lowest power if the Slow processor at 3.3V, which just meets our performance needs and uses only 1/27 (approx 0.037) as much power. Now, if half of the power drain doesn't change, and the other half of the power drain gets 27 times less, how much does the total power drain improve? Original power drain = 1; new power drain = 0.5 + 0.5/27. Ratio is 1.93. (This is very similar to an Amdahl's Law calculation.) So the new battery life is 1.93*15 minutes or a bit less than 29 minutes. (Comparable to a Windows CE device. :-)) Even if we cut the processor power to 0, we'd still only get to 30 minutes, since the satellite transponder is chewing up half the power.