Distributed Top-K Monitoring
Brian Babcock and Chris Olston
Abstract
The querying and analysis of data streams has been a topic of much
recent interest, motivated by applications from the fields of
networking, web usage analysis, sensor instrumentation,
telecommunications, and others. Many of these applications involve
monitoring answers to continuous queries over data streams produced at
physically distributed locations, and most previous approaches require
streams to be transmitted to a single location for centralized
processing. Unfortunately, the continual transmission of a large
number of rapid data streams to a central location can be impractical
or expensive. We study a useful class of queries that continuously
report the k largest values obtained from distributed data streams
("top-k monitoring queries"), which are of particular interest
because they can be used to reduce the overhead incurred while running
other types of monitoring queries. We show that transmitting entire
data streams is unnecessary to support these queries and present an
alternative approach that reduces communication significantly. In our
approach, arithmetic constraints are maintained at remote stream
sources to ensure that the most recently provided top-k answer
remains valid to within a user-specified error tolerance. Distributed
communication is only necessary on occasion, when constraints are
violated, and we show empirically through extensive simulation on
real-world data that our approach reduces overall communication cost
by an order of magnitude compared with alternatives that
offer the same error guarantees.
Conference Paper (SIGMOD 2003): [PS], [PDF].
Citation: [BibTeX]
Extended Version: [PS], [PDF]
TRAPP Project Web Page: [HTML]