Report Number: CSL-TR-97-712
Institution: Stanford University, Computer Systems Laboratory
Title: Hive: Operating System Fault Containment for Shared-Memory
Multiprocessors
Author: Chapin, John
Date: July 1997
Abstract: Reliability and scalability are major concerns when designing
general-purpose operating systems for large-scale
shared-memory multiprocessors. This dissertation describes
Hive, an operating system with a novel kernel architecture
that addresses these issues. Hive is structured as an
internal distributed system of independent kernels called
cells. This architecture improves reliability because a
hardware or software error damages only one cell rather than
the whole system. The architecture improves scalability
because few kernel resources are shared by processes running
on different cells. The Hive prototype is a complete
implementation of UNIX SVR4 and is targeted to run on the
Stanford FLASH multiprocessor.
The research described in the dissertation makes three
primary contributions: (1) it demonstrates that distributed
system mechanisms can be used to provide fault containment
inside a shared- memory multiprocessor; (2) it provides a
specification for a set of hardware features, implemented in
the Stanford FLASH, that are sufficient to support fault
containment; and (3) it demonstrates how to take advantage of
shared-memory hardware across cell boundaries at both
application and kernel levels while preserving fault
containment. The dissertation also analyzes the architectural
and performance tradeoffs of multicellular kernels.
Fault injection experiments conducted using the SimOS machine
simulator demonstrate the reliability of the Hive prototype.
Studies using both general-purpose and scientific workloads
illustrate the performance tradeoffs of the multicellular
kernel architecture.
http://i.stanford.edu/pub/cstr/reports/csl/tr/97/712/CSL-TR-97-712.pdf