Report Number: CSL-TR-92-550
Institution: Stanford University, Computer Systems Laboratory
Title: Cache Coherence Directories for Scalable Multiprocessors
Author: Simoni, Richard
Date: October 1992
Abstract: Directory-based protocols have been proposed as an efficient means of implementing cache coherence in large-scale shared-memory multiprocessors. This thesis explores the trade-offs in the design of cache coherence directories by examining the organization of the directory information, the options in the design of the coherency protocol, and the implementation of the directory and protocol. The traditional directory organization that maintains a full valid bit vector per directory entry is unsuitable for large-scale machines due to high storage overhead. This thesis proposes several alternate organizations. Limited pointers directories replace the bit vactor with several pointers that indicate those caches containing the data. Although this scheme performs well across a wide range of workloads, its performance does not improve as the read/write ratio becomes very large. To address this drawback, a dynamic pointer allocation directory is proposed. This directory allocates pointers from a pool to particular memory blocks as they are needed. Since the pointers may be allocated to any block on the memory module, the probability of running short of pointers is very small. Among the set of possible organizations, dynamic pointer allocation lies at an attractive cost/performance point. Measuring the performance impact of three coherency protocol features makes the virtues of simplicity clear. Adding a clean/exclusive state to reduce the time required to write a clean block results in only modest performance improvement. Using request forwarding to transfer a dirty block directly to another cache that has requested it yields similar results. For small cache block sizes, write hits to clean blocks can be simply treated as write misses without incurring significant extra network traffic. Protocol features designed to improve performance must be examined carefully, for they often complicate the protocol without offering substantial benefit. Implementing directory-based coherency presents several challenges. Methods are described for preventing deadlock, maintaining a model of parallel execution, handling subtle situations caused by temporary inconsistencies between cache and directory state, and tolerating out-of-order message delivery. Using these techniques, cache coherence can be added to large-scale multiprocessors in an inexpensive yet effective manner.