Classification is a fundamental task in statistical data processing. As many real-world applications operate in a dynamic environment (i.e., accruing data continuously over time), it's unweldy and sometimes infeasible to regenerate the classification model every time the underlying data are updated. To address this issue, Hazy-Classify takes the model-based views approach and incrementally maintains classification models as training data evolve. The novel adaptive eager-lazy tradeoff algorithm inside Hazy-Classify enables performance that is orders of magnitude more efficient compared to baseline solutions.
The key motivating observation for Hazy-Classify is that, when a small portion of the training data (say a couple examples) are updated, the corresponding optimal classification model only changes slightly -- both in theory and empirically. For example, the change of the weight vector of an SVM can be effectively bounded as a function of incremental updates on the training data. The relatively stable evolution of the model in turn allows us to make adjustments of only a small portion of predictions (on testing or production data). Hazy-Classify manages both data (including training data and prediction results) and models inside an RDBMS. A prototype was implemented using UDFs on top of PostgreSQL and is available for download.
Hazy-Classify is released under the GPL v3 license.
For more technical details about Hazy-Classify, please refer to our VLDB 2011 paper.
Support
Hazy-Classify is generously supported by the Air Force Research Laboratory (AFRL) under prime contract no. FA8750-09-C-0181, the National Science Foundation under IIS-1054009, the University of Wisconsin-Madison, and gifts or research awards from Microsoft, Google, Johnson Controls, Inc.. Any opinions, findings, and conclusion or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of any of the above sponsors including DARPA, AFRL, or the US government.