Report Number: CSL-TR-95-685
Institution: Stanford University, Computer Systems Laboratory
Title: Memory Consistency Models for Shared-Memory Multiprocessors
Author: Gharachorloo, Kourosh
Date: December 1995
Abstract: The memory consistency model for a shared-memory
multiprocessor specifies the behavior of memory with respect
to read and write operations from multiple processors. As
such, the memory model influences many aspects of system
design, including the design of programming languages,
compilers, and the underlying hardware. Relaxed models that
impose fewer memory ordering constraints offer the potential
for higher performance by allowing hardware and software to
overlap and reorder memory operations. However, fewer
ordering guarantees can compromise programmability and
portability. Many of the previously proposed models either
fail to provide reasonable programming semantics or are
biased toward programming ease at the cost of sacrificing
performance. Furthermore, the lack of consensus on an
acceptable model hinders software portability across
different systems.
This dissertation focuses on providing a balanced solution
that directly addresses the trade-off between programming
ease and performance. To address programmability, we propose
an alternative method for specifying memory behavior that
presents a higher level abstraction to the programmer. We
show that with only a few types of information supplied by
the programmer, an implementation can exploit the full range
of optimizations enabled by previous models. Furthermore, the
same information enables automatic and efficient portability
across a wide range of implementations.
To expose the optimizations enabled by a model, we have
developed a formal framework for specifying the low-level
ordering constraints that must be enforced by an
implementation. Based on these specifications, we present a
wide range of architecture and compiler implementation
techniques for efficiently supporting a given model. Finally,
we evaluate the performance benefits of exploiting relaxed
models based on detailed simulations of realistic parallel
applications. Our results show that the optimizations enabled
by relaxed models are extremely effective in hiding virtually
the full latency of writes in architectures with blocking
reads (i.e., processor stalls on reads), with gains as high
as 80\%. Architectures with non-blocking reads can further
exploit relaxed models to hide a substantial fraction of the
read latency as well, leading to a larger overall performance
benefit. Furthermore, these optimizations complement gains
from other latency hiding techniques such as prefetching and
multiple contexts.
We believe that the combined benefits in hardware and
software will make relaxed models universal in future
multiprocessors, as is already evidenced by their adoption in
several commercial systems.
http://i.stanford.edu/pub/cstr/reports/csl/tr/95/685/CSL-TR-95-685.pdf