Report Number: CSL-TR-91-483
Institution: Stanford University, Computer Systems Laboratory
Title: Suggestions for implementing a fast IEEE multiply-add-fused instruction
Author: Quach, Nhon
Author: Flynn, Michael
Date: July 1991
Abstract: We studied three possible strategies to overlap the operations in a floating-point add (FPA) and a floating-point multiply (FPM) for implementing an IEEE multiply-add-fused (MAF) instruction. The operations in FPM and FPA are: (a) non-overlapped, (b) fully-overlapped, and (c) partially-overlapped. The first strategy corresponds to multiply-add-chained (MAC) widely used in vector processors. The second (Greedy) strategy uses a greedy algorithm, yielding an implementation similar to the IBM RS/6000 one. The third and final (SNAP) strategy uses a less aggressive starting configuration and corresponds to the SNAP implementation. An IEEE MAF delivers the same result as that obtained via a separate IEEE FPM and FPA. Two observations have prompted this study. First, in the IBM RS/6000 implementation, the design tradeoffs have been made for high internal data precision, which facilitates the execution of elementary functions. These tradeoff decisions, however, may not be valid for an IEEE MAF. Second, the RS/6000 implementation assumed a different critical path for FPA and FPM, which does not reflect the current state-of-the-art in FP technology. Using latency and hardware costs as the performance metrics we show that: (1) MAC has the lowest FPA latency and consumes the least hardware. But its MAF latency is the highest. (2) Greedy has a medium MAF latency but the highest FPA latency. And (3) SNAP has the lowest MAF latency and a slightly higher FPA latency than that of MAC, consuming an area that is comparable with that of Greedy. Both Greedy and SNAP have higher design complexity arising from rounding for the IEEE standard. SNAP has an additional wire complexity, which Greedy does not have because of its simpler datapath. If rounding for the IEEE standard is not an issue, the Greedy strategy --- and therefore the RS/6000 --- seems reasonable for applications with a high MAF to FPA ratio.
http://i.stanford.edu/pub/cstr/reports/csl/tr/91/483/CSL-TR-91-483.pdf