Report Number: CSL-TR-99-785
Institution: Stanford University, Computer Systems Laboratory
Title: Checkpointing Apparatus and Algorithms for Fault-Tolerant Tightly-Coupled Multiprocessors
Author: Sunada, Dwight
Date: July 1999
Abstract: The apparatus and algorithms for establishing checkpoints on a tightly-coupled multiprocessor (TCMP) fall naturally into three broad classes: tightly synchronized method, loosely synchronized method, and unsynchronized method. The algorithms in the class of the tightly synchronized method force the immediate establishment of a checkpoint whenever a dependency between two processors arises. The algorithms in the class of the loosely synchronized method record this dependency and, hence, do not require the immediate establishment of a checkpoint if a dependency does arise; when a processor chooses to establish a checkpoint, the processor will query the dependency records to determine other processors that must also establish a checkpoint. The algorithms in the class of the unsynchronized method allow a processor to establish a checkpoint without regard to any other processor. Within this framework, we develop four apparatus and algorithms: distributed recoverable shared memory (DRSM), DRSM for communication checkpoints (DRSM-C), DRSM with half of the memory (DRSM-H), and DRSM with logs (DRSM-L). DRSM-C is an algorithm in the class of the tightly synchronized method, and DRSM and DRSM-H are algorithms in the class of the loosely synchronized method. DRSM-L is an algorithm in the class of the unsynchronized method and is the first of its kind for a TCMP. DRSM-L has the best performance in terms of minimizing the impact of establishing checkpoints (or logs) on the running applications and has the least expensive hardware.
http://i.stanford.edu/pub/cstr/reports/csl/tr/99/785/CSL-TR-99-785.pdf