BIB-VERSION:: CS-TR-v2.0 ID:: STAN//CSL-TR-99-776 ENTRY:: February 11, 1999 ORGANIZATION:: Stanford University, Computer Systems Laboratory TITLE:: Novel Checkpointing Algorithm for Fault Tolerance on a Tightly-Coupled Multiprocessor TYPE:: Technical Report AUTHOR:: Sunada, Dwight AUTHOR:: Glasco, David AUTHOR:: Flynn, Michael DATE:: January 1999 PAGES:: 52 ABSTRACT:: The tightly-coupled multiprocessor (TCMP), where specialized hardware maintains the image of a single shared memory, offers the highest performance in a computer system. In order to deploy a TCMP in the commercial world, the TCMP must be fault tolerant. Researchers have designed various checkpointing algorithms to implement fault tolerance in a TCMP. To date, these algorithms fall into 2 principal classes, where processors can be checkpoint dependent on each other. We introduce a new apparatus and algorithm that represents a 3rd class of checkpointing scheme. Our algorithm is distributed recoverable shared memory with logs (DRSM-L) and is the first of its kind for TCMPs. DRSM-L has the desirable property that a processor can establish a checkpoint or roll back to the last checkpoint in a manner that is independent of any other processor. In this paper, we describe DRSM-L, show the optimal value of its principal design parameter, and present results indicating its performance under simulation. NOTES:: [Adminitrivia V1/Prg/19990211] END:: STAN//CSL-TR-99-776