CPSC 418, Problem Set #3 Solutions

Problem 1:  ISA

We've learned in this course the goal of defining an ISA:  it provides
a clean interface for what the programmer or compiler can assume about
the processor, thereby allowing changes to software and hardware
implementation without breaking software compatibility.  Unfortunately,
it's not always clear exactly what's part of the ISA and what's part
of the implementation.  For example, AMD's K6 processor line is supposed
to be completely compatible with Intel's processors.  When the K6 first
came out, there was a problem with a Windows device driver that required
a patch to fix.  It turns out that this device driver was using a busy
loop in software to generate a fixed time delay.  In the K6, the loop
instruction had been optimized, so this delay loop ran too fast on the K6.
(a)  Was the way Microsoft wrote the delay loop good programming practice?
	Were they obeying the ISA, or were they relying on implementation
	details?  Justify your answer.

	Good programming practice?  No.  Not for a desktop computer.
	The only time delay loops like that are justified is in tiny
	embedded applications where you have no other means of getting
	timing information.

	Were they observing the ISA or exploiting implementation details?
	I prefer the answer that this was an implementation detail, but
	either answer is OK as long as you justify it properly.  (Sort
	of like an arts class, eh?)

	For ISA:  The ISA is what's programmer visible.  If you look at
	a manual for one of these processors, it will tell you how long
	each instruction takes.  Therefore, these are part of the spec
	and are visible to the programmer, and can be used by the programmer.
	(This argument applies more for embedded processors, but one can
	make the argument.)

	Against ISA:  While the timings are in some sense visible, the
	whole point of an ISA is to allow various implementations.  Obviously,
	we want the freedom for different implementations to run faster or
	slower, so it's against the spirt of the ISA to rely on the specific
	instruction timings for a specific implementation.

(b)  Was it OK for AMD to change the timing of the loop instruction?
	Was this a violation of the ISA?  Justify your answer.

	OK to change?  Sure.  That's the whole point of making faster
	processors.

	Violation of ISA?  You should be consistent with your answer in
	part (a).

(c)  Recently, AMD has introduced new 3D-Now instructions to perform faster
	single precision floating point.  Similarly, Intel has introduced
	Streaming SIMD Extensions, which are new numeric instructions
	for graphics.  Are these changes to the ISA?  If so, how can one
	write code that will run on both platforms?

	Yes, these are changes to the ISA, as they add new instructions.
	Any code that uses the new instructions won't run on a machine
	that doesn't support them (unless you add special operating system
	support to trap the unimplemented instructions).  If you want to be
	truly portable, you've got to avoid the added instructions.
	Otherwise, you could try relying on the operating system to
	encapsulate the new instructions in library routines (e.g. OpenGL)
	or provide software emulation.


Problem 2:  CISC vs. RISC

As senior architect at Foobar Computer Corp., you've been charged with
updating the Foobium II processor by adding Intel's new Streaming SIMD
Extension instructions.  Your design teams have come up with two
options.  Option 1 is that they can add functional units for the new
instructions.  This will allow the new instructions to execute in 1
CPI.  Unfortunately, the added complexity reduces the clock frequency
by ten percent.  Option 2 is handle the new instructions in additional
microcode, just as the complex x86 instructions are already handled.
This will allow your new processor to run at full clock speed, but the
average CPI for the new instructions is now 10.  The CPI of the old
instructions is unchanged.

	OOPS!!!  I threw this problem together quickly and forgot to
	give you all required information.  In particular, I neglected
	to tell you what the CPI was for the non-Streaming-SIMD-Extension
	instructions was.  If you assumed a reasonable CPI (like 1) for
	the other instructions and worked the problem correctly from
	there, you get full credit.  If you solved the problem
	correctly in terms of the unknown quantity, then you get some
	extra credit.  Note that it's hard to attack this problem
	using Amdahl's Law, as the 5 percent information I gave you
	is for *instructions*, whereas Amdahl's Law is based on
	fraction of *time*.  So, again, you'd need CPI information
	to convert them.

(a)  If in the future, you expect 5 percent of all instructions executed
	to be the new Streaming SIMD Extension instructions, which option
	gives you the faster processor?

	Solution 1 (assuming CPI for other stuff is 1):
	    CPUTime = IC * CPI * clock period
	    Option 1:
		CPUTime1 = IC * (0.95*1 + 0.05*1) * (clock period)/0.9
			= 1.11*IC*(clock period)
		Note that it's wrong to multiply by 1.1.  The clock
		*frequency* is 10% slower or 90% of the original frequency,
		so the clock *period* is 1/0.9.
	    Option 2:
		CPUTime2 = IC * (0.95*1 + 0.05*10) * (clock period)
			= 1.45*IC*(clock period)
	    So, Option 1 is much faster.

	Solution 2 (general solution, worth extra credit):
	    CPUTime = IC * CPI * clock period
	    Let OldCPI be the CPI for the unaffected instructions.
	    Option 1:
		CPUTime2 = IC * (0.95*OldCPI + 0.05*1) * (clock period)/0.9
	    Option 2:
		CPUTime2 = IC * (0.95*OldCPI + 0.05*10) * (clock period)
	    If you work out the numbers, you'll find that these two
	    are the same when OldCPI is about 4.3 or so.  So, if the
	    OldCPI>4.3, then Option 2 is faster; otherwise, Option 1 is
	    faster.

(b)  What percentage of code being new instructions is the crossover point:
	the point where Options 1 and 2 have the same average performance?

	Again, we have the choice of assuming some OldCPI value, like 1,
	or solving the general problem.

	Solution 1 (assuming CPI for other stuff is 1):
	    CPUTime = IC * CPI * clock period
	    Option 1:
		CPUTime1 = IC * ((1-x)*1 + x*1) * (clock period)/0.9
			= 1.11*IC*(clock period)
	    Option 2:
		CPUTime2 = IC * ((1-x)*1 + x*10) * (clock period)
			= (1+9x)*IC*(clock period)
	    The crossover point is when these two are the same.
	    Solving for x, we get x is about 1.2%.  So, Option 2 is
	    faster only if the new instructions are very uncommon
	    (less than 1.2% of instruction mix).

	Solution 2 (general solution, worth extra credit):
	    Option 1:
		CPUTime2 = IC * ((1-x)*OldCPI + x*1) * (clock period)/0.9
	    Option 2:
		CPUTime2 = IC * ((1-x)*OldCPI + x*10) * (clock period)
	    So, we have ((1-x)*OldCPI + x)/0.9 = (1-x)*OldCPI + 10x
	    and we just need to solve for x.  I get:
		x = (1/9)*OldCPI / ((1/9)*OldCPI + 80/9)
		  = 1 / (1 + 80/OldCPI)

(c)  The product manager (the marketing leader of the team) is worried
	the project is going over budget.  He has just read an article
	about picoJava (the same one you read), where they described
	implementing really complex instructions as traps to the OS software.
	He recommends scrapping the project and simply selling Foobium II
	processors as new "Foobium III with Streaming SIMD Extension"
	processors, and handling the new instructions in software.
	There are obvious performance problems with this approach, but
	the product manager argues that no one is using the new instructions
	yet, so no one will notice the performance problems for a while.
	Would the "new" processor (relabeled Foobium II) implement the
	new Pentium III ISA (with Streaming SIMD Extensions)?  Assuming
	appropriate software to handle the new instructions, would your
	"new" processor be able to execute applications programs that
	used the new instructions?  Would the "new" processor be compatible
	at the system software level?  Explain your answers.

	No, the new processor would not implement the new Pentium III ISA.
	We can always emulate instruction sets in software, but that
	doesn't change the ISA of the underlying processor.  For an
	extreme example, I could take an old, old processor like a Z-80
	or a 6502 or something, and write an interpreter for it to
	emulate a Pentium III.  That doesn't make the old processor
	a Pentium III.  The ISA refers to what the processor actually
	implements in hardware.

	With appropriate software support, the Foobium III could
	execute application software written for the Pentium III.
	Unimplemented instructions would get emulated in software
	by the operating system.  (This is similar to how a Pentium
	can run Java byte code, or how a PowerPC-based Macintosh
	can run applications written for 68000-based Macintoshes.)

	No, the new processor would not be able to run system
	software, since it's the system software that provides
	the support to emulate the unimplemented instructions.

Problem 3:  Predication

Assume integer registers r0..r31, and predicate registers p0..p31.
You have predicate assignment statements like:
	peq	p1, r1, r2	-- set p1 to true iff r1==r2
	pne	p2, r1, r2	-- set p2 to true iff r1!=r2
	pge	p2, r1, r2	-- set p2 to true iff r1>=r2
	etc.
You also have predicate logical operations like:
	pand	p1, p2, p3	-- set p1 to be p2 AND p3
	por	p1, p2, p3	-- set p1 to be p2 OR p3
	pnot	p1, p2		-- set p1 to be NOT p2
All instructions can be predicated, e.g.
	fadd	f1,f2,f3 (p1)	-- if p1 then f1=f2+f3

Consider this code fragment:
	L1:	bge	r1, 1900, done:	-- if r1>=1900 goto done:
	L2:	blt	r1, 30, y2k:	-- if r1<=30 goto y2k:
	L3:	addi	r1, r1, 1900	-- r1+=1900
	L4:	br	done		-- goto done:
	y2k:
	L5:	addi	r1, r1, 2000	-- r1+=2000
	done:

Convert the above code to use predication.

	(BTW, there's a typo in the code for line L2 -- the mnemonic says
	strictly less than, but the comment says less than or equal.
	You may assume either.)

	Lots of possible solutions.  The key is to get the same behavior,
	but to eliminate all branches.  Here's one possibility:
		plt	p1, r1, 1900	-- p1 is r1<1900
		plt	p2, r1, 30	-- p2 is r1<30
		pnot	p3, p2		-- p3 = !p2
		pand	p4, p1, p2	-- p4 = p1 && p2
		pand	p5, p1, p3	-- p4 = p1 && !p2
		addi	r1, r1, 1900 (p5)
		addi	r1, r1, 2000 (p4)


Problem 4:  Power

You are team leader for the new Furby2000 toy.  One of the key design
goals is full digital audio processing and speech recognition, performed
by the single microprocessor in the toy.  The primary performance
design requirement is that the Furby2000 processor be able to process
a minute of audio input within ten seconds.  Your engineering team has
greatly exceeded the performance goal:  the prototype can process a minute
of audio input in 1 second.  Unfortunately, the prototype has a battery
life of only 15 minutes.
(a)  Your goal is to reduce processor power consumption.  The current
	prototype uses a processor running at 5 volts and 100Mhz, but
	you can scale the voltage down to 3.3 or 2.5 volts (forcing a
	slower clock frequency of 66 and 50Mhz respectively).  There is
	also a lower performance processor available (with the same ISA
	and CPI) running at 16Mhz and consuming 1/8 of the power,
	running at 5 volts.  This processor can also be scaled down to
	3.3 or 2.5 volts, with the expected slow-down in clock speed.
	For each processor and voltage combination, estimate the
	performance and power consumption relative to the original
	prototype.  You may ignore leakage current, and threshold
	voltage effects.

	P = V^2 * C * f

	We can make a table:
	Proc	Voltage	Clock	Power	Speed
	Fast	5V	100Mhz	1	1
		This is the baseline.
	Fast	3.3	66	0.30	0.67
		Clock frequency scales linearly with voltage, so the
		power gets reduced by a factor of (2/3)^2 because of the
		voltage times an additional (2/3) because of the frequency,
		giving 0.30.
	Fast	2.5	50	0.125	0.50
	Slow	5V	16Mhz	0.125	0.167
		The slower processor uses 1/8 of the power and runs 1/6
		as fast.
	Slow	3.3V	10.7Mhz	0.037	0.107
	Slow	2.5V	8Mhz	0.016	0.083

(b)  If half of the total system power consumption is the microprocessor,
	(The other half is mainly the satellite transponder that relays
	collected marketing data to company headquarters.) what will be
	the battery life of the improved toy, given your best choice from
	part (a)?  You may assume the battery holds a fixed total amount
	of energy, regardless of power drain (not true in reality).

	By our performance requirement, we can slow down by a factor
	of 10 at most (from 1 second to 10 seconds).  Therefore, we
	can't use the Slow processor at 2.5V.  The next lowest power
	if the Slow processor at 3.3V, which just meets our performance
	needs and uses only 1/27 (approx 0.037) as much power.

	Now, if half of the power drain doesn't change, and the other
	half of the power drain gets 27 times less, how much does the
	total power drain improve?  Original power drain = 1;
	new power drain = 0.5 + 0.5/27.  Ratio is 1.93.  (This is
	very similar to an Amdahl's Law calculation.)  So the
	new battery life is 1.93*15 minutes or a bit less than
	29 minutes.  (Comparable to a Windows CE device. :-))

	Even if we cut the processor power to 0, we'd still only
	get to 30 minutes, since the satellite transponder is chewing
	up half the power.