CPSC 418 --- Advanced Computer Architecture --- Spring 1996
Solution 12 (No Credit Review Assignment)
This homework is for your use in studying for the final (Tue
9 April 12:00pm, in CSCI (Old CS Building) room 201).
Problem 1
- Pipelines
- definition
Divide the processing of instructions into stages.
Overlap
processing of multiple instructions by sending next
instruction into first stage as soon as previous instruction
is done in first stage.
- how does pipelining improve the performance
of a microprocessor?
By overlapping processing of instructions, no single instruction
finishes in any less time, but instructions are issued more
frequently (with less delay between subsequent issues).
- give upper and lower bounds on the number of stages in
typical microprocessors' pipeline
5 min, 12 max
- why not have fewer or more stages in a pipeline?
Fewer stages would increase clock cycle because more circuitry
within each stage. More stages would means that too high of a
percentage of the clock cycle would be spent in latches and
other control circuitry; also, this would increase the
latency of instructions --- causing more data hazards and stalls.
- Superscalar
- definition
Issue multiple instructions per cycle.
- how does ``superscalar-ing'' improve the performance
of a microprocessor?
More instructions are processed in parallel.
- give upper and lower bounds on the ``superscalar-edness' in
typical microprocessors
2 -- 4 instructions / cycle.
- why not have more or less ``superscalar-edness'' in a
microprocessor?
Fewer instructions per cycle would not fully utilize the
functional units that can be put on a microprocessor. More
instructions would require more functional units than can be
put on a microprocessor; also, there are fundamental limits
to the amount of instruction level parallelism in programs.
- Caches
- definition
Store currently active code and data in small fast memory.
- how do caches improve the performance
of a microprocessor?
Instruction fetch and data accesses take less time.
- give upper and lower bounds on the number and sizes of caches in
typical microprocessors
1 or 2 levels of cache, plus main memory, plus virtual memory
- what design decisions must be made in designing a cache system
Cache size, associativity, replacement policy, write-scheme.
- why not have more or bigger caches in amicroprocessor?
Given the current technology, increasing the cache size would
be very expensive, but not yield large performance gains.
(current caches give very good performance)
- does increasing the set-associativity of a cache increase or
decrease the hit-rate?
increase. Increasing the set-associativity from 1-way
(direct-mapped) to 2-way gives almost as good performance as doubling
the cache size. Further increases in set-associativity do increase
performance as significantly.
- does increasing the set-associativity of a cache increase or
decrease the complexity of the hardware
increase
- Register renaming
- definition
Utilize more physical registers than architected registers.
- how does register renaming improve the performance
of a microprocessor?
Allows instructions to execute out-of-order while avoiding
WAR and WAW hazards.
- what design decisions must be made in designing a register
renaming system
Number of physical registers, how to deal with interrupts and
maybe precise vs imprecise interrupts.
- Multiple / Duplicate functional units
- give an example of a typical configuration for a
microprocessor with multiple and/or duplicate functional units
2 integer units with overlapped functionality. 2
floating-point units with disjoint functionality. 1
load/store unit. 1 branch unit.
- how do multiple and/or duplicate functional units improve
the performance of a microprocessor?
Increases the number of instructions that can be processed
in parallel.
- what design decisions must be made in designing a
microprocessor with duplicate and/or multiple functional units?
Number of units and what each one does; plus design each unit.
- Interrupts and Exceptions
- list three different causes or uses of interrupts/exceptions
- External devices and events (e.g. keypress)
- O/S control (e.g. protect data-structures,
preserve mutual-exclusion using interrupt levels)
- Handle errors (e.g. divide-by-zero)
- why do interrupts and exceptions complicate microprocessor design?
Interrupts ``interrupt'' the normal flow of instructions, can
come from a very large number of different sources, and multiple
interrupts may occur at the same time. Because they are not
part of the normal flow of instructions, it is easy to
forget about them. Because there are so many different types
and sources of interrupts, it is extremely difficult to
comprehend everything required to handle them correctly.
Similarly, testing a design to make sure it works correctly is
very difficult, because there are so many possibilities to consider.
- briefly describe how a typical microprocessor implements interrupts
- Each interrupt level that can be signalled from an external
device has an associated interrupt request and acknowledge line.
- The interrupting device raises the interrupt line.
- CPU get in a consistent state by flushing the pipeline,
restoring register values, etc
- CPU saves consistent state on stack or in hardware with
duplicated registers
- CPU sets corresponding interrupt level to true
- CPU acknowledges interrupt using interrupt-ack signal
- Interrupting device puts interrupt vector on address bus
- CPU jumps to address pointed to by interrupt vector in interrupt
vector table.
- when interrupt routine is done, it executes return-from-interrupt
instruction
- return-from-interrupt causes CPU to restore state, continue
execution
Problem 2
We talked about the affects of the Speed Demon vs Brainiac decision on
the instruction set architecture. What affects might this decision
have on other aspects of a CPU? In particular: the memory system,
interrupt handling, branch prediction and register renaming.
- Memory system
- A Speed Demon will fetch more instructions per second than a Brainiac,
hence the speed of the instruction-cache is more important for Speed
Demons than
Brainiacs. Similarly, the less complex Speed Demon instructions means that
Speed Demons require a larger inst-cache. This is also true for data-caches,
but to
a lesser degree, because most operations will use values in the register file
and so the data-cache will not be as heavily impacted as the inst-cache.
Because the speed demon executes more instructions per second, cache misses
have
a larger penalty in clock cycles for Speed Demons than for Brainiacs. Thus,
reducing the length of time for a cache miss is just as important as reducing
the chances of getting a cache miss.
- Interrupts
- A Speed Demon is less likely to have register renaming or other extremly
complex control circuits. This may make interrupt handling in a Speed Demon a
little simpler than for a Brainiac, but this is only a small difference.
- Branch Prediction
- Just as cache misses are more detrimental in a Speed Demon than in a
Brainiac, branch misprediction also takes a higher toll in lost instructions.
This is primarily because Speed Demons typically require more pipe stages to
decode instructions and resolve branches.
The high clock speeds in a Speed Demon means that highly complex branch
prediction schemes are infeasible, or if used, will require extra hardware to
make them quickly enough.
- Register Renaming
- The high clock speeds make register renaming more difficult to implement
for Speed Demons (however, the R10000 has register renaming and is something
of a Speed Demon)
Problem 3
Register renaming is critical for reducing stalls due to data hazards
in ``register starved'' architectures such as the Intel x86. Can
register renaming be used to reduce the memory traffic in a function
call caused by storing parameters on the stack?
If yes, explain how. If not, explain why not and give an example of
a way to decrease the amount of time spent storing and loading
function parameters to and from the stack.
No. Register renaming creates more physical registers, not more
architected registers. It is the programmer visible
(architected) registers that the program uses store
instructions to save on the stack in order
to pass parameters to functions.
Programs will use store instructions to save registers on the
stack and then load instructions to get the values off the
stack. In examining a program trace, we will see a store
followed closely by a load accessing that same address. A fast
cache will reduce the time required for these instructions. In
addition, a queue that holds stores waiting for cache access
that can also be accessed by loads will help prevent stalls
caused by cache misses. This is done in the Intel P6 using the
Memory
Order Buffer, which is similar to a set of Reservation
Stations.
Problem 4
Two microprocessors (R, for ``rabbit'' and T for ``turtle'') from
different architectural families have the same memory hierarchy
(levels of cache, sizes, speed of memory, etc).
The clock speed for Micro-R is 1.5 times
faster than the clock in Micro-T.
For the SPEC benchmarks on Micro-T:
average memory latency | 3 cycles
|
---|
average CPI for load instruction
(including memory access time) | 2.4 cycles
|
---|
Problem 4.a
Do you have enough information to calculate the average memory latency
in clock cycles for the SPEC benchmarks running on Microprocessor R?
If so, do so. If not, explain what additional information you need.
The memory latency is the time for a read or write to/from memory to
complete. This latency as measured in real time (wall time) is
determined soley by the memory hierarchy and the size and
distribution of code and data. By running roughly the same program
(i.e. same source code, but different compiled code, because two
different architectural families), we estimate that distribution and
size of the code and data will be fairly similar for the two machines.
The memory latency in real time, or wall time for the two
microprocessors will be the same. Thus:
Clock_R = Clock_T / 1.5
because the clock for Rabbit is faster, and therefore has a shorter period.
Latency_T * Clock_T | = | Time
|
Latency_R * Clock_R | = | Time
|
|
Latency_T * Clock_T | = | Latency_R * Clock_R
|
Latency_R | = | Latency_T*Clock_T / Clock_R
|
Latency_R | = | 3*1.5
|
Latency_R | = | 4.5 cycles
|
Problem 4.b
Do you have enough information to calculate the average CPI for a load
instruction (using the SPEC benchmarks) for Microprocessor R? If so,
do so. If not, explain what additional information you need.
No, we don't have enough information. We need to know the internal
configuration of the Micro-R pipeline, the different addressing
modes that are supported, etc, etc. The CPI for a load is
determined both by the memory hierarchy and how the instruction goes
through the pipeline.
Problem 5
You are on the design team for a next-generation microprocessor and
are contemplating two different choices for data-caches.
- Information common to Option 1 and 2
-
- 30% of cache accesses are writes
- at any point in time, 25% of blocks in cache have been modified
- Main-memory: 20 cycle access time
- Option 1
-
- 95% hit rate
- 30% of cache accesses are writes
- at any point in time, 25% of blocks in cache have been modified
- 8 words per block
- 1 cycle access time
- Transfer rate between registers and L1-cache: 1 word/cycle
- Transfer rate between L1-cache and main memory: 8 words/cycle
- not-last-used replacement scheme
- Option 2
-
- 97% hit rate
- 16 words per block
- 1 cycle access time
- Transfer rate between registers and L1-cache: 1 word/cycle
- Transfer rate between L1-cache and main memory: 16 words/cycle
- not-last-used replacement scheme
Problem 5.a
Calculate the average data access time for each cache if the caches
use write-through with no-write-allocate on write miss. How
much faster is the faster access time?
Option 1
From Handout 12:
T_Avg = (%Write)(T_Acc1) + (%Read)((%Hit1)(T_Acc1) + (%Miss1)(T_Fill1))
T_Fill1 = T_Acc1 + T_Acc2 + T_Xfr1_2
= 1 + 20 + (8 words/block)/(1/8 cycles/word)
= 22 cycles
T_Avg =
= 0.30*1 + 0.70*(0.95*1 + 0.05*22)
= 1.74 cycles
Option 2
T_Fill1 = T_Acc1 + T_Acc2 + T_Xfr1_2
= 1 + 20 + (16 words/block)/(1/16 cycles/word)
= 22 cycles
T_Avg =
= 0.30*1 + 0.70*(0.97*1 + 0.03*22)
= 1.44 cycles
Comparison
Option 2 is faster
Option 2 is: (T_Avg1 - T_Avg2) / T_Avg2 percent faster:
(1.74 - 1.44)/1.44 = 21% faster
Problem 5.b
Assuming that the CPI for loads and stores is 1 + T_Avg (from Handout 12), and using the
instruction mix and CPI information from Handout 8, how much faster will the
total performance be for the faster option?
Instruction | Dynamic Count | CPI
|
---|
Arith/Logical | 35.0% | 1.7
|
Load/Store | 28.8% | 2.74 (Option 1)
|
Load/Store | 28.8% | 2.44 (Option 2)
|
Control | 9.8% | 2.3
|
Floating Point | 7.5% | 3.7
|
Other | 18.9% | 1.5
|
Performance is total time to execute a program:
Time = #Inst * CPI * Clock
In this case, the only thing that changes is CPI.
CPI = %Inst_Arith*CPI_Arith + %Inst_LS*CPI_LS +
%Inst_Ctl*CPI_Ctl + %Inst_FP*CPI_FP + %Inst_Other*CPI_Other
= 0.35*1.7 + 0.288*CPI_LS + 0.098*2.3 + 0.075*3.7 + 0.189*1.5
= 1.59 + 0.288*CPI_LS
Plugging in value for Option 1:
CPI_1 = 1.59 + 0.288*2.74
= 2.38
Plugging in value for Option 2:
CPI_2 = 1.59 + 0.288*2.44
= 2.29
CPI_2 is n% faster than CPI_1:
= (CPI_1 - CPI_2)/CPI_2
= (2.38 - 2.29)/2.29
= 3.8%
Problem 5.c
After the initial meeting, you realize that Option 2 will delay the
release date of the microprocessor by three weeks, will this change
your decision?
From Homework 02 (and its solution), on average
performance doubles every two years.
Thus, in order to keep up with the competition, we need to
increase performance by at least:
2^(3/(2*52)) = 1%
The percentage improvement is 3.8%, which is more than 1%, thus
we should still go with Option 2.
Problem 6
Problem 6.a
Approximately how many transistors are on a current microprocessor?
5 -- 10 million
Problem 6.b
Approximately how many transistors do you expect to see on the next
generation of microprocessors (ie ones that come out within the
next year or two)?
10 -- 15 million
Problem 6.c
What factors allow the number of transistors per microprocessor
to increase, and how do the factors allow the increase?
- The most important factor is ``feature-size'', which is
the smallest square that can be put on chip. The smaller
the feature size, the smaller the transistors, and hence
the more transistors that can be fit into a given area.
- Fabrication technology has also improved to allow fewer
fabrication defects per area, hence die sizes have
increased. The larger the die, the more transistors that
can fit on it.
- Better CAD tools allow designers to pack transistors
closer together. This increases the number of transistors that
can fit in a given area, and thereby increases the number of
transistors on a chip.
Problem 6.d
If you were a chip designer, what two things would you consider using
the additional transistors for, and how would you decide which one of the
two choices was preferable?
- Increase cache size
- More functional units
The increased transistors will undoubtedly go to both larger
caches and an on-chip L2 cache as well as more functional units
(which will also probably increase the number of instructions
that can be issued in parallel).
The tradeoff for how many transistors should be used for what
will be decided by performance analysis using simulators to
predict the performance of the chip before it is fully designed.
Last modified: 29 Mar 96