Report Number: CSL-TR-96-699
Institution: Stanford University, Computer Systems Laboratory
Title: Efficient Multiprocessor Communications: Networks,
Algorithms, Simulation, and Implementation
Author: Lu, Yen-Wen
Date: July 1996
Abstract: As technology and processing power continue to improve,
inter-processor communication becomes a performance
bottleneck in a multiprocessor network. In this dissertation,
an enhanced 2-D torus with segmented reconfigurable bus (SRB)
to overcome the delay due to long distance communications was
proposed and analyzed. A procedure of selecting an optimal
segment length and segment alignment based on minimizing the
lifetime of a packet and reducing the interaction between
segments was developed to design a SRB network. Simulation
shows that a torus with SRB is more than twice as efficient
as a traditional torus.
Efficient use of channel bandwidth is an important issue in
improving network performance. The communication links
between two adjacent nodes can be organized as a pair of
opposite uni-directional channels, or combined into a single
bi-directional channel. A modified channel arbitration scheme
with hidden delay, called ``token-exchange,'' was designed
for the bi-directional channel configuration. In spite of the
overhead of channel arbitration, simulation shows that
bi-directional channels have significantly better
latency-throughput performance and can sustain higher data
bandwidth relative to uni-directional channels of the same
channel width. For example, under 2% hot-spot traffic,
bi-directional channels can support 80% more bandwidth
without saturation compared with uni-directional channels.
An efficient, low power, wormhole data router chip for 2-D
mesh and torus networks with bi-directional channels and
token-exchange arbitration was designed and implemented. The
token-exchange delay is fully hidden and no latency penalty
occurs when there is no traffic contention; the
token-exchange delay is also negligible when the contention
is high. Distributed decoders and arbiters are provided for
each of four IO ports, and a fully-connected 5x6 crossbar
switch increases parallelism of data routing. The router also
provides special hardware such as flexible header decoding
and switching to support path-based multicasting. From
measured results, multicasting with two destinations used
only 1/3 of the energy required for unicasting. The wormhole
router was fabricated using MOSIS/HP 0.6um technology. It
delivers 1.6Gb/s (50MHz) @ Vdd=2.1V, consuming an average
power of 15mW.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/699/CSL-TR-96-699.pdf