Report Number: CSL-TR-96-711
Institution: Stanford University, Computer Systems Laboratory
Title: Design Issues in High Performance Floating Point Arithmetic Units
Author: Oberman, Stuart Franklin
Date: December 1996
Abstract: In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the state-of-the-art in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing ever-faster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency. While division is an infrequent operation even in floating point intensive applications, it is shown that ignoring its implementation can result in system performance degradation. A high performance FPU requires a fast and efficient adder, multiplier, and divider. The design question becomes how to best implement the FPU in order to maximize performance given the constraints of silicon die area. The system performance and area impact of functional unit latency is examined for varying instruction issue rates in the context of the SPECfp92 application suite. Performance implications are investigated for shared multiplication hardware, shared square root, on-the-fly rounding and conversion and fused functional units. Due to the importance of low latency FP addition, a variable latency FP addition algorithm has been developed which improves average addition latency by 33% while maintaining single-cycle throughput. To improve the performance and area of linear converging division algorithms, an automated process is proposed for minimizing the complexity of SRT tables. To reduce the average latency of quadratically-converging division algorithms, the technique of reciprocal caching is proposed, along with a method to reduce the latency penalty for exact rounding. A combination of the proposed techniques provides a basis for future high performance floating point units.