BIB-VERSION:: CS-TR-v2.0 ID:: STAN//CSL-TR-96-699 ENTRY:: September 09, 1996 ORGANIZATION:: Stanford University, Computer Systems Laboratory TITLE:: Efficient Multiprocessor Communications: Networks, Algorithms, Simulation, and Implementation TYPE:: Thesis TYPE:: Technical Report AUTHOR:: Lu, Yen-Wen DATE:: July 1996 PAGES:: 183 ABSTRACT:: As technology and processing power continue to improve, inter-processor communication becomes a performance bottleneck in a multiprocessor network. In this dissertation, an enhanced 2-D torus with segmented reconfigurable bus (SRB) to overcome the delay due to long distance communications was proposed and analyzed. A procedure of selecting an optimal segment length and segment alignment based on minimizing the lifetime of a packet and reducing the interaction between segments was developed to design a SRB network. Simulation shows that a torus with SRB is more than twice as efficient as a traditional torus. Efficient use of channel bandwidth is an important issue in improving network performance. The communication links between two adjacent nodes can be organized as a pair of opposite uni-directional channels, or combined into a single bi-directional channel. A modified channel arbitration scheme with hidden delay, called ``token-exchange,'' was designed for the bi-directional channel configuration. In spite of the overhead of channel arbitration, simulation shows that bi-directional channels have significantly better latency-throughput performance and can sustain higher data bandwidth relative to uni-directional channels of the same channel width. For example, under 2% hot-spot traffic, bi-directional channels can support 80% more bandwidth without saturation compared with uni-directional channels. An efficient, low power, wormhole data router chip for 2-D mesh and torus networks with bi-directional channels and token-exchange arbitration was designed and implemented. The token-exchange delay is fully hidden and no latency penalty occurs when there is no traffic contention; the token-exchange delay is also negligible when the contention is high. Distributed decoders and arbiters are provided for each of four IO ports, and a fully-connected 5x6 crossbar switch increases parallelism of data routing. The router also provides special hardware such as flexible header decoding and switching to support path-based multicasting. From measured results, multicasting with two destinations used only 1/3 of the energy required for unicasting. The wormhole router was fabricated using MOSIS/HP 0.6um technology. It delivers 1.6Gb/s (50MHz) @ Vdd=2.1V, consuming an average power of 15mW. NOTES:: [Adminitrivia V1/Prg/19960909] END:: STAN//CSL-TR-96-699