Report Number: CSL-TR-97-744
Institution: Stanford University, Computer Systems Laboratory
Title: The FLASH Multiprocessor: Designing a Flexible and Scalable
System
Author: Kuskin, Jeffrey Scott
Date: November 1997
Abstract: The choice of a communication paradigm, or protocol, is
central to the design of a large-scale multiprocessor system.
Unlike traditional multiprocessors, the FLASH machine uses a
programmable node controller, called MAGIC, to implement all
protocol processing. The architecture of the MAGIC chip
allows FLASH to support multiple communication paradigms - in
particular, cache-coherent shared memory and high-performance
message passing - while minimizing both hardware and software
overhead. Each node in FLASH contains a microprocessor, a
portion of the machine's global memory, a port to the
interconnection network, an I/O interface, and MAGIC, the
custom node controller. The MAGIC chip handles all
communication both within the node and among nodes, using
hardwired data paths for efficient data movement and a
programmable processor optimized for executing protocol
operations. The result is a system that is flexible and
scalable, yet competitive in performance with a traditional
multiprocessor that implements a single communication
paradigm completely in hardware.
The focus of this dissertation is the architecture, design,
and performance of FLASH. Much of the motivation behind the
FLASH system and the MAGIC node controller design stems from
an examination of the characteristics of protocol code and
the architecture of the DASH system, the predecessor to
FLASH. This examination led to two major design goals:
development of a node controller architecture that can attain
high protocol processing performance while still maintaining
flexibility and a need to reduce the logic and memory
overheads associated with cache coherence. The MAGIC design
achieves these goals by implementing on a single chip a
programmable protocol engine with an instruction set
optimized for the characteristics of protocol code, along
with dedicated support logic to alleviate the most serious
protocol processing performance bottlenecks - data movement,
message dispatch, and lack of close coupling to the node
board components. The design of the FLASH node complements
the MAGIC design, matching the close coupling and high
bandwidth support in MAGIC to provide a balanced node
architecture.
Next, the dissertation investigates the performance of
cache-coherence on FLASH. Performance results are presented
from microbenchmarks run on the Verilog RTL of the MAGIC chip
and from complete applications run on FlashLite, the FLASH
system-level simulator. The microbenchmarks demonstrate that
the architectural extensions added to the MAGIC design -
particularly the instruction set optimizations to the
programmable protocol processor - yield significantly lower
latencies and protocol processor occupancies to service the
most common types of memory operations.
The application results are used to evaluate the performance
costs of flexibility by comparing the performance of FLASH to
that of a hardwired machine on representative parallel
applications and multiprogramming workloads. These results
show that poor application memory reference or load balancing
characteristics cause the performance of the FLASH system to
degrade more rapidly than the performance of the hardwired
system; that is, FLASH's performance is less robust. For
applications that incur a large number of remote misses or
exhibit substantial hot-spotting, the increased remote access
latencies or the occupancy of MAGIC lead to lower performance
for the flexible design.
Overall, however, the performance of FLASH can be competitive
with the performance of the hardwired machine. Specifically,
for a range of optimized parallel applications, the
performance differences between the hardwired machine and
FLASH are small, typically less than 10% at 32 processors and
less than 15% at 64 processors. For these programs, either
the processor cache miss rates are small or the latency of
the programmable protocol processing can be hidden behind the
memory access time.
http://i.stanford.edu/pub/cstr/reports/csl/tr/97/744/CSL-TR-97-744.pdf