Project Staccato

Introduction

The mass digitization of books, printed documents and forms is changing the types of data that enterprises and academics manage. The current approach is to convert the scanned images to plain text using OCR, and manage the text with an RDBMS or custom applications. However, OCR conversion is often error-prone. Thus, querying over the resulting text may miss answers leading to poor quality (specifically, recall) for applications. State-of-the-art OCR tools, e.g., the OCR powering Google Books, produce a probabilistic model, called a Finite State Transducer (FST), to capture all possible conversions. Only at the end is this uncertainty information discarded to produce plaintext. Thus, if we instead manage the uncertainty in the OCR data using these probabilistic models, the answer quality can be improved.

What Staccato Is

Staccato is a system that enables applications to manage probabilistic OCR data inside an RDBMS as if it were regular text. It allows SQL queries with regular expressions over the OCR data and manages the transducers using probabilistic relational database semantics.

What Staccato Does

A key challenge, addressed by Staccato, is that the OCR transducers, while giving high answer quality, can be very large in size, and slow down query processing by upto 1000x. This can be too extreme for many applications, whose quality-performance requirements may fall somewhere in between. Thus, the central component of Staccato is a novel approximation technique that intelligently reduces the amount of data in the transducers. This technique has parameters that act as a knob to tradeoff between the extremes of low query performance - high answer quality and high query performance - low answer quality. Furthermore, Staccato provides a way to employ inverted indexing over the approximated transducers.

Staccato currently supports handling OCR data. However, the underlying model, transducers can also capture the uncertainty in other domains like speech and sensor data.

Staccato is implemented in C++ on top of PostgreSQL. The source code and datasets are available at the download page. Staccato is released under the Apache License, Version 2.0.

For more technical details about Staccato, please refer to our upcoming VLDB 2012 paper, or the more detailed technical report.

Acknowledgements

Staccato is generously supported by the Microsoft Jim Gray Systems Lab, the National Science Foundation CAREER Award under IIS-1054009 and the Office of Naval Research under award no. N000141210041. Any opinions, findings, conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of any of the above sponsors including the US government and Microsoft.