Report Number: CSL-TR-97-712
Institution: Stanford University, Computer Systems Laboratory
Title: Hive: Operating System Fault Containment for Shared-Memory Multiprocessors
Author: Chapin, John
Date: July 1997
Abstract: Reliability and scalability are major concerns when designing general-purpose operating systems for large-scale shared-memory multiprocessors. This dissertation describes Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This architecture improves reliability because a hardware or software error damages only one cell rather than the whole system. The architecture improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. The research described in the dissertation makes three primary contributions: (1) it demonstrates that distributed system mechanisms can be used to provide fault containment inside a shared- memory multiprocessor; (2) it provides a specification for a set of hardware features, implemented in the Stanford FLASH, that are sufficient to support fault containment; and (3) it demonstrates how to take advantage of shared-memory hardware across cell boundaries at both application and kernel levels while preserving fault containment. The dissertation also analyzes the architectural and performance tradeoffs of multicellular kernels. Fault injection experiments conducted using the SimOS machine simulator demonstrate the reliability of the Hive prototype. Studies using both general-purpose and scientific workloads illustrate the performance tradeoffs of the multicellular kernel architecture.