Report Number: CSL-TR-99-785
Institution: Stanford University, Computer Systems Laboratory
Title: Checkpointing Apparatus and Algorithms for Fault-Tolerant
Tightly-Coupled Multiprocessors
Author: Sunada, Dwight
Date: July 1999
Abstract: The apparatus and algorithms for establishing checkpoints on
a tightly-coupled multiprocessor (TCMP) fall naturally into
three broad classes: tightly synchronized method, loosely
synchronized method, and unsynchronized method. The
algorithms in the class of the tightly synchronized method
force the immediate establishment of a checkpoint whenever a
dependency between two processors arises. The algorithms in
the class of the loosely synchronized method record this
dependency and, hence, do not require the immediate
establishment of a checkpoint if a dependency does arise;
when a processor chooses to establish a checkpoint, the
processor will query the dependency records to determine
other processors that must also establish a checkpoint. The
algorithms in the class of the unsynchronized method allow a
processor to establish a checkpoint without regard to any
other processor. Within this framework, we develop four
apparatus and algorithms: distributed recoverable shared
memory (DRSM), DRSM for communication checkpoints (DRSM-C),
DRSM with half of the memory (DRSM-H), and DRSM with logs
(DRSM-L). DRSM-C is an algorithm in the class of the tightly
synchronized method, and DRSM and DRSM-H are algorithms in
the class of the loosely synchronized method. DRSM-L is an
algorithm in the class of the unsynchronized method and is
the first of its kind for a TCMP. DRSM-L has the best
performance in terms of minimizing the impact of establishing
checkpoints (or logs) on the running applications and has the
least expensive hardware.
http://i.stanford.edu/pub/cstr/reports/csl/tr/99/785/CSL-TR-99-785.pdf