BIB-VERSION:: CS-TR-v2.0 ID:: STAN//CSL-TR-99-789 ENTRY:: November 16, 1999 ORGANIZATION:: Stanford University, Computer Systems Laboratory TITLE:: Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors TYPE:: Thesis TYPE:: Technical Report AUTHOR:: Soundararajan, Vijayaraghavan DATE:: November 1999 PAGES:: 143 ABSTRACT:: Shared-memory multiprocessors are being used increasingly as compute servers. These systems enable efficient usage of computing resources through the aggregation and tight coupling of CPUs, memory, and I/O. One popular design for such machines is a bus-based architecture. However, as processors get faster, the shared bus becomes a bandwidth bottleneck. CC-NUMA (Cache-Coherent with Non-Uniform Memory Access time) machines remove this architectural limitation and provide a scalable shared-memory architecture. One significant characteristic of the CC-NUMA architecture is that the latency to access remote data is considerably larger than the latency to access local data. On such machines, good data locality can reduce memory stall time and is therefore critical for high performance. In this thesis we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. We utilize the programmability of the FLASH memory controller to explore a number of techniques for improving data locality: base cache-coherence (CC); a Remote Access Cache (RAC), in which a portion of local memory is used to cache remotely-allocated data at cache-line granularity; a Cache-Only Memory Architecture (COMA-F), in which all of local memory is used as a cache under hardware control; and OS-assisted page migration/replication (MigRep), in which the operating system migrates or replicates pages according to observed cache miss patterns. We then propose a novel hybrid scheme, MIGRAC, that combines the benefits of RAC and MigRep. We evaluate complete implementations of these schemes on the same platform using compute-server workloads (including OS effects), thereby providing a more consistent and detailed evaluation than has been done before. We find that a simple RAC can improve performance significantly over CC (up to 64% gains). COMA-F improves locality but its additional complexity limits its gains versus CC (only 14% improvement). MigRep performs well (up to 33% gains) but does not handle fine-grain sharing as effectively as RAC or COMA-F. Finally, our MIGRAC approach performs well relative to RAC (up to 57% faster) and MigRep (up to 24% faster) and is robust. END:: STAN//CSL-TR-99-789