CPSC 418, Problem Set #1, Due 10am, January 21, 1999
(Remember:  No late homework!)

Problem 1:  (15 points)

Something like Moore's Law also applies to the amount of computing power
you can buy for a given amount of money.  For example, $2000 today buys
you twice as fast of a machine as $2000 did 18 months ago, or equivalently,
a machine that cost $2000 18 months ago will only cost $1000 today.
(a) Assuming that price/performance drops smoothly at this rate, halving
	every 1.5 years, how much do prices drop every week?

	Solution:  There are 52 weeks in a year, so there are 78 weeks in
	1.5 years.  Therefore, we want the fraction that mulitplied 78 times
	gives us 0.5; in other words, we want the 78th root of 0.5.
	This is approximately 0.991153.  Therefore prices drop by
	1-0.991153 or 0.008847 each week, or a bit under 1% per week.

(b) Note that anything related to price is sensitive to the choice of
	currency.  If we assume that the rate in (a) is based on US dollars,
	and we assume that the Canadian dollar is losing 1 percent per week
	in value against the US dollar, when is the best price/performance
	available in Canadian dollars?

	Solution:  If the Canadian dollar is dropping 1% per week, and
	the prices in US dollars are dropping just under 1% per week,
	we actually get the best price/performance right now!

Problem 2:  (25 points)

Foobar Computer Company makes a microprocessor with a die size of
100 square millimeters.  They are currently built in a fab on 20cm diameter
wafers.  The wafers cost $3000.  Defect density in the fab is 1.0 defects
per square centimeter.  For the current process, Alpha=3.0.  Testing costs
$10 per die.  Packaging costs $15 per die.  Using the cost model presented
in class, calculate the cost of each microprocessor.  You may assume a
final test yield of 1.0.

Foobar is considering switching to a more advanced process, which would
shrink the die size to 90 square millimeters.  The wafers are still 20cm
diameter, but cost $4000 in the newer process.  Defect density is the same,
but the more complex process has Alpha=4.0.  Furthermore, testing is more
expensive at $15 per die.  Packaging costs are the same.  Would the
microprocessor be more or less expensive in the new process?

	Solution:  It's important to make sure that all areas are in the
	same units.  I'll convert to square centimeters, so the old die
	is 1 sq cm, and the new die is 0.9 sq cm.

	The die yield for the old die is (1+1*1/3.0)^(-3.0) = 0.421875.
	Dies per wafer for the old die is pi*(10)^2/1 - pi*20/sqrt(2),
	which equals 269.730, or about 270.  Therefore the cost of each
	die is 3000/(270*0.421875) or about $26.  (The estimate of dies
	per wafer is fairly rough.)  Adding testing and packaging
	gives us a total cost of about $51.

	For the new die, the yield is (1+0.9*1/4.0)^(-4.0) = 0.444074.
	Dies per wafer is now pi*(10)^2/0.9 - pi*20/sqrt(1.8),
	which gives 302.233, or about 302.  Therefore, the cost of each
	die is 4000/(302*0.444074) or about $30, giving a total cost of
	about $60.  Therefore, in this case, the older process was cheaper.

Problem 3:  (15 points)

On a set of four benchmark programs, a certain computer has run times
of 12 seconds, 9 seconds, 3 seconds, and 4 seconds.
(a)  Compute the arithmetic mean and geometric mean of these times.

	Solution:  Arithmetic mean = (12+9+3+4)/4 = 7.
	Geometric mean = (12*9*3*4)^(0.25) = 6.

(b)  Suppose that because of measurement error, the times above can be
	up to 1 second off from the true run time (e.g., on the first
	program, the true run time could have been as much as 13 seconds
	or as little as 11 seconds).  In the worst case, how far off could
	the arithmetic mean be?  The geometric mean?

	Solution:  The arithmetic mean can be up to 1 second off in either
	direction.  You can either compute this directly, or note that
	the added/subtracted 1 can be distributed out.  For the geometric
	mean, it's easiest just to compute directly.  The real geometric mean
	could be as low as (11*8*2*3)^(0.25) = 4.8, or 1.2 seconds lower
	than 6.  Or, it could be as high as (13*10*4*50)^(0.25) = 7.1, or
	up to 1.1 seconds higher.

Problem 4:  (15 points)

The new Foobium II processor runs exactly like a Pentium II, except that
floating point square root has been sped up by a factor of 10.
(a)  If your applications spend 5% of the time doing floating point square
	root, how much speed up will you observe for this application on the
	Foobium II over a Pentium II.

	Solution:  This is a direct application of Amdahl's Law.
	Speedup is 1/((1-0.05)+0.05/10) = 1.047.  So we're going less
	than 5% faster.

(b)  Your buddy reports that on *his* programs, his new Foobium II is
	twice as fast as an old Pentium II.  What fraction of his time
	did he used to spend on floating point square root?

	Solution:  This is using Amdahl's Law in reverse.  We've got
	2 = 1/((1-f)+f/10).  Solving for f gives us f=5/9.  Your buddy
	must *really* like floating point square root!

Problem 5:  (30 points)

You have been hired as a benchmarketing consultant by Foobar Computer Company.
They have designed the new Foo1000 processor, which is completely compatible
with the competitor's product.  However, thanks to Foobar's amazing
computer designers, the new Foo1000 can execute NOP (no-op -- an instruction
that does nothing) instructions in zero clock cycles, whereas the competitor
requires one clock cycle to execute a NOP.  On all other instructions,
the two machines have the same CPI.  Unfortunately, the Foo1000 runs at
only 90Mhz, whereas the competitor runs at 100Mhz.  Your task is to write
a short "benchmark" program that takes at least 10 seconds on a Foo1000,
and that shows that the Foo1000 is more than twice as fast as the competitor.
You may use the following instructions:
	ADD	Rdst,Rsrc,Rsrc
				    Add.  The destination register gets the
				    sum of the src registers.  CPI=1
	ADDI	Rdst,Rsrc,#constant
				    Add Immediate.  The destination register
				    gets the the source register plus the
				    constant.  CPI=1
	BGT	Rsrc, Label
				    If the source register is greater than zero,
				    branch to Label.  CPI=3.  There are no
				    branch delay slots.
	NOP
				    No op.  Does nothing.
	LDI	Rdst, #constant
				    Load Immediate.  The destination register
				    gets loaded with the constant.  CPI=1
You have 32 registers named R0..R31.  The constant in the immediate field
can be any number between -16 and 15 inclusive for ADDI, and between
-512 and 511 inclusive for LDI.

You must explain why your solution achieves the desired goal.

	Solution:  Obviously, we need a program with a whole lot of NOPs
	in it.  To get the required run time in a small program, we'll
	need to loop quite a bit -- 10 seconds on a 90Mhz machine means
	we need 900 million clock cycles!  Because the immediate fields
	are fairly small, we'll need 3 or 4 nested loops, e.g.:

		LDI	R1, #360	; 1 cycle
	Loop1:	LDI	R2, #500	; 1 cycle
	Loop2:	LDI	R3, #500	; 1 cycle
	Loop3:
		<loop body goes here>	; x cycles
		ADDI	R3,R3,#-1	; 1 cycle
		BGT	R3, Loop3	; 3 cycles
		ADDI	R2,R2,#-1	; 1 cycle
		BGT	R2, Loop2	; 3 cycles
		ADDI	R1,R1,#-1	; 1 cycle
		BGT	R1, Loop1	; 3 cycles

	This template gives me 90 million iterations of the inner loop body.
	More precisely, it takes 1 + 360(1 + 500(1 + 500(x+1+3) + 1+3) + 1+3)
	or 90000000x + 360901801 cycles to execute.  So, if the loop body
	takes at least 6 clock cycles on the Foo1000, we'll pass the
	10 second mark.  To get proper speedup, we'll need to pad the
	loop body with enough NOPs.  Note that if we add 10 NOPs we'll
	almost double the number of clock cycles.  However, our competitor
	has a more than 11% faster clock than we do, so we'll need to
	add a few more NOPs.  13 NOPs is enough, giving the proposed benchmark:

		LDI	R1, #360	; 1 cycle
	Loop1:	LDI	R2, #500	; 1 cycle
	Loop2:	LDI	R3, #500	; 1 cycle
	Loop3:
		ADD	R4, R4, R4;	; 1 cycle
		ADD	R4, R4, R4;	; 1 cycle
		ADD	R4, R4, R4;	; 1 cycle
		ADD	R4, R4, R4;	; 1 cycle
		ADD	R4, R4, R4;	; 1 cycle
		ADD	R4, R4, R4;	; 1 cycle

		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp
		NOP			; 0 cycles on Foo1000, 1 cycle on comp

		ADDI	R3,R3,#-1	; 1 cycle
		BGT	R3, Loop3	; 3 cycles
		ADDI	R2,R2,#-1	; 1 cycle
		BGT	R2, Loop2	; 3 cycles
		ADDI	R1,R1,#-1	; 1 cycle
		BGT	R1, Loop1	; 3 cycles

	Run time on the Foo1000 is 90000000(6)+360901801 cycles at 90Mhz
	or 10.01 seconds.  Run time on the competition is
	90000000(19)+360901801 cycles at 100Mhz or 20.71 seconds.
	The Foo1000 is TWICE AS FAST as the competition!