CPSC 418: Solution 12

Home

Suggestion Box

CPSC 418 --- Advanced Computer Architecture --- Spring 1996
Solution 12 (No Credit Review Assignment)

This homework is for your use in studying for the final (Tue 9 April 12:00pm, in CSCI (Old CS Building) room 201).

Problem 1

Pipelines
1. definition
  Divide the processing of instructions into stages. Overlap processing of multiple instructions by sending next instruction into first stage as soon as previous instruction is done in first stage.
2. how does pipelining improve the performance of a microprocessor?
  By overlapping processing of instructions, no single instruction finishes in any less time, but instructions are issued more frequently (with less delay between subsequent issues).
3. give upper and lower bounds on the number of stages in typical microprocessors' pipeline
  5 min, 12 max
4. why not have fewer or more stages in a pipeline?
  Fewer stages would increase clock cycle because more circuitry within each stage. More stages would means that too high of a percentage of the clock cycle would be spent in latches and other control circuitry; also, this would increase the latency of instructions --- causing more data hazards and stalls.
Superscalar
1. definition
  Issue multiple instructions per cycle.
2. how does ``superscalar-ing'' improve the performance of a microprocessor?
  More instructions are processed in parallel.
3. give upper and lower bounds on the ``superscalar-edness' in typical microprocessors
  2 -- 4 instructions / cycle.
4. why not have more or less ``superscalar-edness'' in a microprocessor?
  Fewer instructions per cycle would not fully utilize the functional units that can be put on a microprocessor. More instructions would require more functional units than can be put on a microprocessor; also, there are fundamental limits to the amount of instruction level parallelism in programs.
Caches
1. definition
  Store currently active code and data in small fast memory.
2. how do caches improve the performance of a microprocessor?
  Instruction fetch and data accesses take less time.
3. give upper and lower bounds on the number and sizes of caches in typical microprocessors
  1 or 2 levels of cache, plus main memory, plus virtual memory
4. what design decisions must be made in designing a cache system
  Cache size, associativity, replacement policy, write-scheme.
5. why not have more or bigger caches in amicroprocessor?
  Given the current technology, increasing the cache size would be very expensive, but not yield large performance gains. (current caches give very good performance)
6. does increasing the set-associativity of a cache increase or decrease the hit-rate?
  increase. Increasing the set-associativity from 1-way (direct-mapped) to 2-way gives almost as good performance as doubling the cache size. Further increases in set-associativity do increase performance as significantly.
7. does increasing the set-associativity of a cache increase or decrease the complexity of the hardware
  increase
Register renaming
1. definition
  Utilize more physical registers than architected registers.
2. how does register renaming improve the performance of a microprocessor?
  Allows instructions to execute out-of-order while avoiding WAR and WAW hazards.
3. what design decisions must be made in designing a register renaming system
  Number of physical registers, how to deal with interrupts and maybe precise vs imprecise interrupts.
Multiple / Duplicate functional units
1. give an example of a typical configuration for a microprocessor with multiple and/or duplicate functional units
  2 integer units with overlapped functionality. 2 floating-point units with disjoint functionality. 1 load/store unit. 1 branch unit.
2. how do multiple and/or duplicate functional units improve the performance of a microprocessor?
  Increases the number of instructions that can be processed in parallel.
3. what design decisions must be made in designing a microprocessor with duplicate and/or multiple functional units?
  Number of units and what each one does; plus design each unit.
Interrupts and Exceptions
1. list three different causes or uses of interrupts/exceptions
  - External devices and events (e.g. keypress)
  - O/S control (e.g. protect data-structures, preserve mutual-exclusion using interrupt levels)
  - Handle errors (e.g. divide-by-zero)
2. why do interrupts and exceptions complicate microprocessor design?
  Interrupts ``interrupt'' the normal flow of instructions, can come from a very large number of different sources, and multiple interrupts may occur at the same time. Because they are not part of the normal flow of instructions, it is easy to forget about them. Because there are so many different types and sources of interrupts, it is extremely difficult to comprehend everything required to handle them correctly. Similarly, testing a design to make sure it works correctly is very difficult, because there are so many possibilities to consider.
3. briefly describe how a typical microprocessor implements interrupts
  - Each interrupt level that can be signalled from an external device has an associated interrupt request and acknowledge line.
  - The interrupting device raises the interrupt line.
  - CPU get in a consistent state by flushing the pipeline, restoring register values, etc
  - CPU saves consistent state on stack or in hardware with duplicated registers
  - CPU sets corresponding interrupt level to true
  - CPU acknowledges interrupt using interrupt-ack signal
  - Interrupting device puts interrupt vector on address bus
  - CPU jumps to address pointed to by interrupt vector in interrupt vector table.
  - when interrupt routine is done, it executes return-from-interrupt instruction
  - return-from-interrupt causes CPU to restore state, continue execution

Problem 2

We talked about the affects of the Speed Demon vs Brainiac decision on the instruction set architecture. What affects might this decision have on other aspects of a CPU? In particular: the memory system, interrupt handling, branch prediction and register renaming.

Memory system: A Speed Demon will fetch more instructions per second than a Brainiac, hence the speed of the instruction-cache is more important for Speed Demons than Brainiacs. Similarly, the less complex Speed Demon instructions means that Speed Demons require a larger inst-cache. This is also true for data-caches, but to a lesser degree, because most operations will use values in the register file and so the data-cache will not be as heavily impacted as the inst-cache.
Because the speed demon executes more instructions per second, cache misses have a larger penalty in clock cycles for Speed Demons than for Brainiacs. Thus, reducing the length of time for a cache miss is just as important as reducing the chances of getting a cache miss.
Interrupts: A Speed Demon is less likely to have register renaming or other extremly complex control circuits. This may make interrupt handling in a Speed Demon a little simpler than for a Brainiac, but this is only a small difference.
Branch Prediction: Just as cache misses are more detrimental in a Speed Demon than in a Brainiac, branch misprediction also takes a higher toll in lost instructions. This is primarily because Speed Demons typically require more pipe stages to decode instructions and resolve branches.
The high clock speeds in a Speed Demon means that highly complex branch prediction schemes are infeasible, or if used, will require extra hardware to make them quickly enough.
Register Renaming: The high clock speeds make register renaming more difficult to implement for Speed Demons (however, the R10000 has register renaming and is something of a Speed Demon)

Problem 3

Register renaming is critical for reducing stalls due to data hazards in ``register starved'' architectures such as the Intel x86. Can register renaming be used to reduce the memory traffic in a function call caused by storing parameters on the stack?

If yes, explain how. If not, explain why not and give an example of a way to decrease the amount of time spent storing and loading function parameters to and from the stack.

No. Register renaming creates more physical registers, not more architected registers. It is the programmer visible (architected) registers that the program uses store instructions to save on the stack in order to pass parameters to functions.

Programs will use store instructions to save registers on the stack and then load instructions to get the values off the stack. In examining a program trace, we will see a store followed closely by a load accessing that same address. A fast cache will reduce the time required for these instructions. In addition, a queue that holds stores waiting for cache access that can also be accessed by loads will help prevent stalls caused by cache misses. This is done in the Intel P6 using the Memory Order Buffer, which is similar to a set of Reservation Stations.

Problem 4

Two microprocessors (R, for ``rabbit'' and T for ``turtle'') from different architectural families have the same memory hierarchy (levels of cache, sizes, speed of memory, etc). The clock speed for Micro-R is 1.5 times faster than the clock in Micro-T. For the SPEC benchmarks on Micro-T:

average memory latency	3 cycles
average CPI for load instruction (including memory access time)	2.4 cycles

Problem 4.a

Do you have enough information to calculate the average memory latency in clock cycles for the SPEC benchmarks running on Microprocessor R? If so, do so. If not, explain what additional information you need.

The memory latency is the time for a read or write to/from memory to complete. This latency as measured in real time (wall time) is determined soley by the memory hierarchy and the size and distribution of code and data. By running roughly the same program (i.e. same source code, but different compiled code, because two different architectural families), we estimate that distribution and size of the code and data will be fairly similar for the two machines.

The memory latency in real time, or wall time for the two microprocessors will be the same. Thus:

Clock_R = Clock_T / 1.5
because the clock for Rabbit is faster, and therefore has a shorter period.

Latency_T * Clock_T	=	Time
Latency_R * Clock_R	=	Time

Latency_T * Clock_T	=	Latency_R * Clock_R
Latency_R	=	Latency_T*Clock_T / Clock_R
Latency_R	=	3*1.5
Latency_R	=	4.5 cycles

Problem 4.b

Do you have enough information to calculate the average CPI for a load instruction (using the SPEC benchmarks) for Microprocessor R? If so, do so. If not, explain what additional information you need.

No, we don't have enough information. We need to know the internal configuration of the Micro-R pipeline, the different addressing modes that are supported, etc, etc. The CPI for a load is determined both by the memory hierarchy and how the instruction goes through the pipeline.

Problem 5

You are on the design team for a next-generation microprocessor and are contemplating two different choices for data-caches.

Information common to Option 1 and 2

30% of cache accesses are writes
at any point in time, 25% of blocks in cache have been modified

Main-memory: 20 cycle access time

Option 1

95% hit rate
30% of cache accesses are writes
at any point in time, 25% of blocks in cache have been modified
8 words per block
1 cycle access time
Transfer rate between registers and L1-cache: 1 word/cycle
Transfer rate between L1-cache and main memory: 8 words/cycle
not-last-used replacement scheme

Option 2

97% hit rate
16 words per block
1 cycle access time
Transfer rate between registers and L1-cache: 1 word/cycle
Transfer rate between L1-cache and main memory: 16 words/cycle
not-last-used replacement scheme

Problem 5.a

Calculate the average data access time for each cache if the caches use write-through with no-write-allocate on write miss. How much faster is the faster access time?

Option 1

From Handout 12:
T_Avg = (%Write)(T_Acc1) + (%Read)((%Hit1)(T_Acc1) + (%Miss1)(T_Fill1))

T_Fill1 = T_Acc1 + T_Acc2 + T_Xfr1_2
= 1 + 20 + (8 words/block)/(1/8 cycles/word)
= 22 cycles

T_Avg =
= 0.30*1 + 0.70*(0.95*1 + 0.05*22)
= 1.74 cycles

Option 2

T_Fill1 = T_Acc1 + T_Acc2 + T_Xfr1_2
= 1 + 20 + (16 words/block)/(1/16 cycles/word)
= 22 cycles

T_Avg =
= 0.30*1 + 0.70*(0.97*1 + 0.03*22)
= 1.44 cycles

Comparison

Option 2 is faster

Option 2 is: (T_Avg1 - T_Avg2) / T_Avg2 percent faster:
(1.74 - 1.44)/1.44 = 21% faster

Problem 5.b

Assuming that the CPI for loads and stores is 1 + T_Avg (from Handout 12), and using the instruction mix and CPI information from Handout 8, how much faster will the total performance be for the faster option?

Instruction Dynamic Count CPI
Arith/Logical 35.0% 1.7
Load/Store 28.8% 2.74 (Option 1)
Load/Store 28.8% 2.44 (Option 2)
Control 9.8% 2.3
Floating Point 7.5% 3.7
Other 18.9% 1.5
Performance is total time to execute a program:
Time = #Inst * CPI * Clock

Instruction	Dynamic Count	CPI
Arith/Logical	35.0%	1.7
Load/Store	28.8%	2.74 (Option 1)
Load/Store	28.8%	2.44 (Option 2)
Control	9.8%	2.3
Floating Point	7.5%	3.7
Other	18.9%	1.5

In this case, the only thing that changes is CPI.

CPI = %Inst_Arith*CPI_Arith + %Inst_LS*CPI_LS + %Inst_Ctl*CPI_Ctl + %Inst_FP*CPI_FP + %Inst_Other*CPI_Other
= 0.35*1.7 + 0.288*CPI_LS + 0.098*2.3 + 0.075*3.7 + 0.189*1.5
= 1.59 + 0.288*CPI_LS

Plugging in value for Option 1:
CPI_1 = 1.59 + 0.288*2.74
= 2.38

Plugging in value for Option 2:
CPI_2 = 1.59 + 0.288*2.44
= 2.29

CPI_2 is n% faster than CPI_1:
= (CPI_1 - CPI_2)/CPI_2
= (2.38 - 2.29)/2.29
= 3.8%

Problem 5.c

After the initial meeting, you realize that Option 2 will delay the release date of the microprocessor by three weeks, will this change your decision? From Homework 02 (and its solution), on average performance doubles every two years.

Thus, in order to keep up with the competition, we need to increase performance by at least:
2^(3/(2*52)) = 1%

The percentage improvement is 3.8%, which is more than 1%, thus we should still go with Option 2.

Problem 6

Problem 6.a

Approximately how many transistors are on a current microprocessor?

5 -- 10 million

Problem 6.b

Approximately how many transistors do you expect to see on the next generation of microprocessors (ie ones that come out within the next year or two)?

10 -- 15 million

Problem 6.c

What factors allow the number of transistors per microprocessor to increase, and how do the factors allow the increase?

The most important factor is ``feature-size'', which is the smallest square that can be put on chip. The smaller the feature size, the smaller the transistors, and hence the more transistors that can be fit into a given area.
Fabrication technology has also improved to allow fewer fabrication defects per area, hence die sizes have increased. The larger the die, the more transistors that can fit on it.
Better CAD tools allow designers to pack transistors closer together. This increases the number of transistors that can fit in a given area, and thereby increases the number of transistors on a chip.

Problem 6.d

If you were a chip designer, what two things would you consider using the additional transistors for, and how would you decide which one of the two choices was preferable?

Increase cache size
More functional units

The increased transistors will undoubtedly go to both larger caches and an on-chip L2 cache as well as more functional units (which will also probably increase the number of instructions that can be issued in parallel).

The tradeoff for how many transistors should be used for what will be decided by performance analysis using simulators to predict the performance of the chip before it is fully designed.

Home

Suggestion Box

Last modified: 29 Mar 96