Report Number: CSL-TR-91-483
Institution: Stanford University, Computer Systems Laboratory
Title: Suggestions for implementing a fast IEEE multiply-add-fused instruction
Author: Quach, Nhon
Author: Flynn, Michael
Date: July 1991
Abstract: We studied three possible strategies to overlap the
operations in a floating-point add (FPA) and a floating-point
multiply (FPM) for implementing an IEEE multiply-add-fused
(MAF) instruction. The operations in FPM and FPA are: (a)
non-overlapped, (b) fully-overlapped, and (c)
partially-overlapped. The first strategy corresponds to
multiply-add-chained (MAC) widely used in vector processors.
The second (Greedy) strategy uses a greedy algorithm,
yielding an implementation similar to the IBM RS/6000 one.
The third and final (SNAP) strategy uses a less aggressive
starting configuration and corresponds to the SNAP
implementation. An IEEE MAF delivers the same result as that
obtained via a separate IEEE FPM and FPA. Two observations
have prompted this study. First, in the IBM RS/6000
implementation, the design tradeoffs have been made for high
internal data precision, which facilitates the execution of
elementary functions. These tradeoff decisions, however, may
not be valid for an IEEE MAF. Second, the RS/6000
implementation assumed a different critical path for FPA and
FPM, which does not reflect the current state-of-the-art in
FP technology. Using latency and hardware costs as the
performance metrics we show that: (1) MAC has the lowest FPA
latency and consumes the least hardware. But its MAF latency
is the highest. (2) Greedy has a medium MAF latency but the
highest FPA latency. And (3) SNAP has the lowest MAF latency
and a slightly higher FPA latency than that of MAC, consuming
an area that is comparable with that of Greedy. Both Greedy
and SNAP have higher design complexity arising from rounding
for the IEEE standard. SNAP has an additional wire
complexity, which Greedy does not have because of its simpler
datapath. If rounding for the IEEE standard is not an issue,
the Greedy strategy --- and therefore the RS/6000 --- seems
reasonable for applications with a high MAF to FPA ratio.
http://i.stanford.edu/pub/cstr/reports/csl/tr/91/483/CSL-TR-91-483.pdf