Elementary: A Storage Manager for Scalable Statistical Inference and Learning
[Version 0.3: pre-alpha release!]
Check out our demos DeepDive and GeoDeepDive, which where built with Elementary!Inference and learning over probabilistic graphical have become important components of data analytics. The lack of a general framework for performing such tasks on terabytes of data may limit the impact of these powerful probabilistic approaches on real-world applications. Elementary is one step closer to developing a general framework that can execute inference and learning over data sets that are larger than main memory.
|The Elementary twist is to store data on secondary storage, e.g., local file systems, Accumulo, or HBase, and run statistical inference and learning in an in-memory buffer. The key challenge for a general framework is how to make execution I/O efficient. Elementary revisits classic I/O tradeoffs studied by the database community and adapts them to terabyte-scale analytics. In this way, Elementary has been successfully applied to complex models, e.g., Latent Dirichlet Allocation (LDA), over 500 million documents.|
Elementary models statistical inference and learning tasks using factor graphs, and runs Gibbs sampling over factor graphs. A factor graph models random variables and their correlations using a bipartite graph. Elementary stores a factor graph as a set of key-value relations, either on disk or in the main-memory. Like Tuffy and Felix, Elementary accepts Markov Logic Networks (MLNs) programs and , and will first translate an MLN program into a factor graph.
Elementary 0.3 introduces the support of BUGS, which has been widly used in sociology, statistics, and biology over the last decade. Compared with other implementation of BUGS, e.g., OpenBUGS, Elementary uses secondary storage to scale up inference of BUGS models. In the current pre-alpha release, Elementary supports 80% of the models in the OpenBUGS examples archive, and is able to scale up to sociology models with hundreds of millions of random variables!
The current version of Elementary supports four storage backends: main memory, Unix files, Accumulo, and HBase. Elementary is capable of the following tasks over factor graphs:
- MAP inference, where we want to find out the most likely assignment of random variables;
- Marginal inference, where we want to estimate the marginal probabilities of random variables;
- Weight learning, where we want to learn the weights of MLN rules, given training data.
- Distant supervision, where we want to learn the weight from what we know, and predict what we do not know given a partial set of training data.