CPSC 418, Problem Set #1, Due 10am, January 21, 1999 (Remember: No late homework!) Problem 1: (15 points) Something like Moore's Law also applies to the amount of computing power you can buy for a given amount of money. For example, $2000 today buys you twice as fast of a machine as $2000 did 18 months ago, or equivalently, a machine that cost $2000 18 months ago will only cost $1000 today. (a) Assuming that price/performance drops smoothly at this rate, halving every 1.5 years, how much do prices drop every week? Solution: There are 52 weeks in a year, so there are 78 weeks in 1.5 years. Therefore, we want the fraction that mulitplied 78 times gives us 0.5; in other words, we want the 78th root of 0.5. This is approximately 0.991153. Therefore prices drop by 1-0.991153 or 0.008847 each week, or a bit under 1% per week. (b) Note that anything related to price is sensitive to the choice of currency. If we assume that the rate in (a) is based on US dollars, and we assume that the Canadian dollar is losing 1 percent per week in value against the US dollar, when is the best price/performance available in Canadian dollars? Solution: If the Canadian dollar is dropping 1% per week, and the prices in US dollars are dropping just under 1% per week, we actually get the best price/performance right now! Problem 2: (25 points) Foobar Computer Company makes a microprocessor with a die size of 100 square millimeters. They are currently built in a fab on 20cm diameter wafers. The wafers cost $3000. Defect density in the fab is 1.0 defects per square centimeter. For the current process, Alpha=3.0. Testing costs $10 per die. Packaging costs $15 per die. Using the cost model presented in class, calculate the cost of each microprocessor. You may assume a final test yield of 1.0. Foobar is considering switching to a more advanced process, which would shrink the die size to 90 square millimeters. The wafers are still 20cm diameter, but cost $4000 in the newer process. Defect density is the same, but the more complex process has Alpha=4.0. Furthermore, testing is more expensive at $15 per die. Packaging costs are the same. Would the microprocessor be more or less expensive in the new process? Solution: It's important to make sure that all areas are in the same units. I'll convert to square centimeters, so the old die is 1 sq cm, and the new die is 0.9 sq cm. The die yield for the old die is (1+1*1/3.0)^(-3.0) = 0.421875. Dies per wafer for the old die is pi*(10)^2/1 - pi*20/sqrt(2), which equals 269.730, or about 270. Therefore the cost of each die is 3000/(270*0.421875) or about $26. (The estimate of dies per wafer is fairly rough.) Adding testing and packaging gives us a total cost of about $51. For the new die, the yield is (1+0.9*1/4.0)^(-4.0) = 0.444074. Dies per wafer is now pi*(10)^2/0.9 - pi*20/sqrt(1.8), which gives 302.233, or about 302. Therefore, the cost of each die is 4000/(302*0.444074) or about $30, giving a total cost of about $60. Therefore, in this case, the older process was cheaper. Problem 3: (15 points) On a set of four benchmark programs, a certain computer has run times of 12 seconds, 9 seconds, 3 seconds, and 4 seconds. (a) Compute the arithmetic mean and geometric mean of these times. Solution: Arithmetic mean = (12+9+3+4)/4 = 7. Geometric mean = (12*9*3*4)^(0.25) = 6. (b) Suppose that because of measurement error, the times above can be up to 1 second off from the true run time (e.g., on the first program, the true run time could have been as much as 13 seconds or as little as 11 seconds). In the worst case, how far off could the arithmetic mean be? The geometric mean? Solution: The arithmetic mean can be up to 1 second off in either direction. You can either compute this directly, or note that the added/subtracted 1 can be distributed out. For the geometric mean, it's easiest just to compute directly. The real geometric mean could be as low as (11*8*2*3)^(0.25) = 4.8, or 1.2 seconds lower than 6. Or, it could be as high as (13*10*4*50)^(0.25) = 7.1, or up to 1.1 seconds higher. Problem 4: (15 points) The new Foobium II processor runs exactly like a Pentium II, except that floating point square root has been sped up by a factor of 10. (a) If your applications spend 5% of the time doing floating point square root, how much speed up will you observe for this application on the Foobium II over a Pentium II. Solution: This is a direct application of Amdahl's Law. Speedup is 1/((1-0.05)+0.05/10) = 1.047. So we're going less than 5% faster. (b) Your buddy reports that on *his* programs, his new Foobium II is twice as fast as an old Pentium II. What fraction of his time did he used to spend on floating point square root? Solution: This is using Amdahl's Law in reverse. We've got 2 = 1/((1-f)+f/10). Solving for f gives us f=5/9. Your buddy must *really* like floating point square root! Problem 5: (30 points) You have been hired as a benchmarketing consultant by Foobar Computer Company. They have designed the new Foo1000 processor, which is completely compatible with the competitor's product. However, thanks to Foobar's amazing computer designers, the new Foo1000 can execute NOP (no-op -- an instruction that does nothing) instructions in zero clock cycles, whereas the competitor requires one clock cycle to execute a NOP. On all other instructions, the two machines have the same CPI. Unfortunately, the Foo1000 runs at only 90Mhz, whereas the competitor runs at 100Mhz. Your task is to write a short "benchmark" program that takes at least 10 seconds on a Foo1000, and that shows that the Foo1000 is more than twice as fast as the competitor. You may use the following instructions: ADD Rdst,Rsrc,Rsrc Add. The destination register gets the sum of the src registers. CPI=1 ADDI Rdst,Rsrc,#constant Add Immediate. The destination register gets the the source register plus the constant. CPI=1 BGT Rsrc, Label If the source register is greater than zero, branch to Label. CPI=3. There are no branch delay slots. NOP No op. Does nothing. LDI Rdst, #constant Load Immediate. The destination register gets loaded with the constant. CPI=1 You have 32 registers named R0..R31. The constant in the immediate field can be any number between -16 and 15 inclusive for ADDI, and between -512 and 511 inclusive for LDI. You must explain why your solution achieves the desired goal. Solution: Obviously, we need a program with a whole lot of NOPs in it. To get the required run time in a small program, we'll need to loop quite a bit -- 10 seconds on a 90Mhz machine means we need 900 million clock cycles! Because the immediate fields are fairly small, we'll need 3 or 4 nested loops, e.g.: LDI R1, #360 ; 1 cycle Loop1: LDI R2, #500 ; 1 cycle Loop2: LDI R3, #500 ; 1 cycle Loop3: ; x cycles ADDI R3,R3,#-1 ; 1 cycle BGT R3, Loop3 ; 3 cycles ADDI R2,R2,#-1 ; 1 cycle BGT R2, Loop2 ; 3 cycles ADDI R1,R1,#-1 ; 1 cycle BGT R1, Loop1 ; 3 cycles This template gives me 90 million iterations of the inner loop body. More precisely, it takes 1 + 360(1 + 500(1 + 500(x+1+3) + 1+3) + 1+3) or 90000000x + 360901801 cycles to execute. So, if the loop body takes at least 6 clock cycles on the Foo1000, we'll pass the 10 second mark. To get proper speedup, we'll need to pad the loop body with enough NOPs. Note that if we add 10 NOPs we'll almost double the number of clock cycles. However, our competitor has a more than 11% faster clock than we do, so we'll need to add a few more NOPs. 13 NOPs is enough, giving the proposed benchmark: LDI R1, #360 ; 1 cycle Loop1: LDI R2, #500 ; 1 cycle Loop2: LDI R3, #500 ; 1 cycle Loop3: ADD R4, R4, R4; ; 1 cycle ADD R4, R4, R4; ; 1 cycle ADD R4, R4, R4; ; 1 cycle ADD R4, R4, R4; ; 1 cycle ADD R4, R4, R4; ; 1 cycle ADD R4, R4, R4; ; 1 cycle NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp NOP ; 0 cycles on Foo1000, 1 cycle on comp ADDI R3,R3,#-1 ; 1 cycle BGT R3, Loop3 ; 3 cycles ADDI R2,R2,#-1 ; 1 cycle BGT R2, Loop2 ; 3 cycles ADDI R1,R1,#-1 ; 1 cycle BGT R1, Loop1 ; 3 cycles Run time on the Foo1000 is 90000000(6)+360901801 cycles at 90Mhz or 10.01 seconds. Run time on the competition is 90000000(19)+360901801 cycles at 100Mhz or 20.71 seconds. The Foo1000 is TWICE AS FAST as the competition!