Report Number: CSL-TR-96-693
Institution: Stanford University, Computer Systems Laboratory
Title: High-Performance CMOS System Design Using Wave Pipelining
Author: Nowka, Kevin J.
Date: January 1996
Abstract: Wave pipelining, or maximum rate pipelining, is a circuit
design technique that allows digital synchronous systems to
be clocked at rates higher than can be achieved with
conventional pipelining techniques. It relies on the
predictable finite signal propagation delay through
combinational logic for virtual data storage. Wave pipelining
of combinational circuits has been shown to achieve clock
rates 2 to 7-times those possible for the same circuits with
conventional pipelining.
Conventional pipelined systems allow data to propagate from a
register through the combinational network to another
register prior to initiating the subsequent data transfer.
Thus, the maximum operating frequency is determined by the
maximum propagation delay through the longest pipeline stage.
Wave pipeline systems apply the subsequent data to the network
as soon as it can be guaranteed that it will not interfere with
the current data wave. The maximum operating frequency of a
wave pipeline is therefore determined by the difference
between the maximum propagation delay and the minimum
propagation delay through the combinational logic.
By minimizing variations in delay, the performance of wave
pipelining is maximized. Data wave interference in CMOS VLSI
circuits is the result of the variation in the propagation
delay due to path length differences, differences in the
state of the network inputs and intermediate nodes, and
difference in fabrication and environmental conditions.
To maximize the performance of wave pipelined circuits, the
path length variations through the combinational logic must
be minimized. A method of modifying the transistor geometries
of individual static CMOS gates so as to tune their delays
has been developed. This method is used by CAD tools that
minimize the path length variation. These tools are used to
equalize delays within a wave pipelined logic block and to
synchronize separate wave pipelined units which share a
common reference clock. This method has been demonstrated to
limit the variation in delay of CMOS circuits to less than
20%.
Delay models have demonstrated that temperature variation,
supply power variations, and noise limit the number of
concurrent waves in CMOS wave pipelined systems to three or
less.
Run-to-run process variation can have a significant impact on
CMOS VLSI signal propagation delay. The ratio of maximum to
minimum delay along the same path for seven different runs of
a 0.8-micron feature size fabrication process was found to be
1.35. Unless this variation is controlled, the speedup of
wave pipelining is limited to two to three to ensure that
devices from any of these runs will operate. When aggregated
with variations due to environmental factors, the maximum
speed-up of a wave pipeline is less than two.
To counteract the effects of process variation, an adaptive
supply voltage technique has been developed. An on-chip
detector circuit determines when delays are faster than the
nominal delays and the power supply is lowered accordingly.
In this manner, ICs fabricated with fast processes are run at
a lower supply voltage to ensure correct operation at the
design target frequency.
To demonstrate that wave pipeline technology can be applied
to VLSI system design, a CMOS wave pipelined vector unit has
been developed. Extensive use of wave pipelining was employed
to achieve high clock rates in the functional units. The VLSI
processor consists of a wave pipelined vector register file,
a wave pipelined adder, a wave pipelined multiplier, load and
store units, an instruction buffer, a scoreboard, and control
logic. The VLSI vector unit contains approximately 47000
transistors and occupies an area of 43 sq mm. It has been
fabricated in a 0.8-micron CMOS technology. Tests indicate
wave pipelined operation at a maximum rate of 303MHz.
An equivalent vector unit design using traditional
latch-based pipelining was designed and simulated. The
latch-based design occupied 2% more die area, operated with a
35% longer clock period, and had multiply latency 8% longer
and add latency 11% longer than the wave pipelined vector
unit.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/693/CSL-TR-96-693.pdf