BIB-VERSION:: CS-TR-v2.0
          ID:: STAN//CSL-TR-99-789
       ENTRY:: November 16, 1999
ORGANIZATION:: Stanford University, Computer Systems Laboratory
       TITLE:: Flexible Use of Memory for Replication/Migration in 
Cache-Coherent DSM Multiprocessors
        TYPE:: Thesis
        TYPE:: Technical Report
      AUTHOR:: Soundararajan, Vijayaraghavan
        DATE:: November 1999
       PAGES:: 143
    ABSTRACT:: Shared-memory multiprocessors are being used increasingly 
as compute servers. These systems enable efficient
usage of computing resources through the aggregation and tight coupling of 
CPUs, memory, and I/O. One popular design for such machines is a bus-based 
architecture. However, as processors get faster, the shared bus becomes a
bandwidth bottleneck. CC-NUMA (Cache-Coherent with Non-Uniform Memory Access time) 
machines remove this architectural limitation and provide a scalable shared-memory 
architecture. One significant characteristic of the CC-NUMA architecture is that 
the latency to access remote data is considerably larger than the latency to access
local data. On such machines, good data locality can reduce memory stall time and 
is therefore critical for high performance.

In this thesis we study the various options available to system designers to 
transparently decrease the fraction of data misses serviced remotely. This work 
is done in the context of the Stanford FLASH multiprocessor. We utilize the
programmability of the FLASH memory controller to explore a number of techniques 
for improving data locality: base cache-coherence (CC); a Remote Access Cache 
(RAC), in which a portion of local memory is used to cache
remotely-allocated data at cache-line granularity; a Cache-Only Memory 
Architecture (COMA-F), in which all of local memory is used as a cache under 
hardware control; and OS-assisted page migration/replication (MigRep), in
which the operating system migrates or replicates pages according to observed 
cache miss patterns. We then propose a novel hybrid scheme, MIGRAC, that combines 
the benefits of RAC and MigRep. We evaluate complete implementations of these 
schemes on the same platform using compute-server workloads (including OS effects), 
thereby providing a more consistent and detailed evaluation than has been done before.

We find that a simple RAC can improve performance significantly over CC 
(up to 64% gains). COMA-F improves locality but its additional complexity limits 
its gains versus CC (only 14% improvement). MigRep performs well (up to 33% gains) 
but does not handle fine-grain sharing as effectively as RAC or COMA-F. Finally, 
our MIGRAC approach performs well relative to RAC (up to 57% faster) and MigRep 
(up to 24% faster) and is robust.
         END:: STAN//CSL-TR-99-789