Project CRFlex

This project is a collaboration with Fei Chen and Min Wang at HP Labs China.

Introduction

Statistical Information Extraction (IE) is important in text processing applications. Example projects that use statistical IE include HP Labs isWiki, MSR Academic Search, and YAGO at MPI. Arguably, the major problem of deploying statistical IE programs to large scale real-world systems is runtime efficiency. To attack this problem, our observation is to take advantage of the evolving nature of the underlying text documents. For a long standing text corpus like Wikipedia, the change within a fixed time period (e.g. 3 weeks) is orders of magnitude smaller than the whole corpus in size. Thus, if we can devise a system that efficiently recycles the results extracted by statistical IE and perform incremental updates, we can improve the runtime efficiency while still guarantee the same results that naive methods output.

What CRFlex Is

CRFlex is a system that optimizes runtime efficiency of statistical IE programs over real-world corpora by recycling. It focuses on the most popular statistical IE model - conditional random fields (CRFs).

What CRFlex Does

The first key challenge in buliding CRFlex is to enable recycling with a finer granularity than document-level, and we aim at token-level recycling. To achieve token-level recycling, we decompose the CRF inference procedure into three steps: (i) computing feature functions, (ii) construsting factor graphs, and (iii) Viterbi inference. For each step, we define the context, which ensures that the output can be safely recycled if the context of its input remains the same.

Although recycling can reduce computational time, it requires intermediate results to be materialized, which has extra IO overhead. Our experiments have shown that it is not always best to materialize all intermediate results in terms of total runtime efficiency. Therefore, in CRFlex, we build a cost-based model to support recycling plan selection.

Our extensive experiments over Wikipedia indicate that CRFlex achieves up to 10X speed-up compared to previous methods.

For more technical details about CRFlex, please refer to our ICDE 2012 paper, or the more detailed technical report.

Acknowledgements

CRFlex is generously supported by the Air Force Research Laboratory under prime contract No. FA8750-09-C-0181, Office of Naval Research Award under No. N000141210041, the NSF CAREER Award under IIS-1054009, the University of Wisconsin-Madison, and gifts or research awards from Google, LogicBlox, and Johnson Controls, Inc. Any opinions, findings, and conclusion or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of any of the above sponsors.