Report Number: CSL-TR-96-699
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient Multiprocessor Communications: Networks, Algorithms, Simulation, and Implementation
Author: Lu, Yen-Wen
Date: July 1996
Abstract: As technology and processing power continue to improve, inter-processor communication becomes a performance bottleneck in a multiprocessor network. In this dissertation, an enhanced 2-D torus with segmented reconfigurable bus (SRB) to overcome the delay due to long distance communications was proposed and analyzed. A procedure of selecting an optimal segment length and segment alignment based on minimizing the lifetime of a packet and reducing the interaction between segments was developed to design a SRB network. Simulation shows that a torus with SRB is more than twice as efficient as a traditional torus. Efficient use of channel bandwidth is an important issue in improving network performance. The communication links between two adjacent nodes can be organized as a pair of opposite uni-directional channels, or combined into a single bi-directional channel. A modified channel arbitration scheme with hidden delay, called ``token-exchange,'' was designed for the bi-directional channel configuration. In spite of the overhead of channel arbitration, simulation shows that bi-directional channels have significantly better latency-throughput performance and can sustain higher data bandwidth relative to uni-directional channels of the same channel width. For example, under 2% hot-spot traffic, bi-directional channels can support 80% more bandwidth without saturation compared with uni-directional channels. An efficient, low power, wormhole data router chip for 2-D mesh and torus networks with bi-directional channels and token-exchange arbitration was designed and implemented. The token-exchange delay is fully hidden and no latency penalty occurs when there is no traffic contention; the token-exchange delay is also negligible when the contention is high. Distributed decoders and arbiters are provided for each of four IO ports, and a fully-connected 5x6 crossbar switch increases parallelism of data routing. The router also provides special hardware such as flexible header decoding and switching to support path-based multicasting. From measured results, multicasting with two destinations used only 1/3 of the energy required for unicasting. The wormhole router was fabricated using MOSIS/HP 0.6um technology. It delivers 1.6Gb/s (50MHz) @ Vdd=2.1V, consuming an average power of 15mW.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/699/CSL-TR-96-699.pdf