BIB-VERSION:: CS-TR-v2.0 ID:: STAN//CSL-TR-99-785 ENTRY:: August 6, 1999 ORGANIZATION:: Stanford University, Computer Systems Laboratory TITLE:: Checkpointing Apparatus and Algorithms for Fault-Tolerant Tightly-Coupled Multiprocessors TYPE:: Thesis TYPE:: Technical Report AUTHOR:: Sunada, Dwight DATE:: July 1999 PAGES:: 146 ABSTRACT:: The apparatus and algorithms for establishing checkpoints on a tightly-coupled multiprocessor (TCMP) fall naturally into three broad classes: tightly synchronized method, loosely synchronized method, and unsynchronized method. The algorithms in the class of the tightly synchronized method force the immediate establishment of a checkpoint whenever a dependency between two processors arises. The algorithms in the class of the loosely synchronized method record this dependency and, hence, do not require the immediate establishment of a checkpoint if a dependency does arise; when a processor chooses to establish a checkpoint, the processor will query the dependency records to determine other processors that must also establish a checkpoint. The algorithms in the class of the unsynchronized method allow a processor to establish a checkpoint without regard to any other processor. Within this framework, we develop four apparatus and algorithms: distributed recoverable shared memory (DRSM), DRSM for communication checkpoints (DRSM-C), DRSM with half of the memory (DRSM-H), and DRSM with logs (DRSM-L). DRSM-C is an algorithm in the class of the tightly synchronized method, and DRSM and DRSM-H are algorithms in the class of the loosely synchronized method. DRSM-L is an algorithm in the class of the unsynchronized method and is the first of its kind for a TCMP. DRSM-L has the best performance in terms of minimizing the impact of establishing checkpoints (or logs) on the running applications and has the least expensive hardware. NOTES:: [Adminitrivia V1/Prg/19990803] END:: STAN//CSL-TR-99-785