Project Elementary

Elementary: A Storage Manager for Scalable Statistical Inference and Learning

[Version 0.3: pre-alpha release!]

Check out our demos DeepDive and GeoDeepDive, which where built with Elementary!

Inference and learning over probabilistic graphical have become important components of data analytics. The lack of a general framework for performing such tasks on terabytes of data may limit the impact of these powerful probabilistic approaches on real-world applications. Elementary is one step closer to developing a general framework that can execute inference and learning over data sets that are larger than main memory.

The Elementary twist is to store data on secondary storage, e.g., local file systems, Accumulo, or HBase, and run statistical inference and learning in an in-memory buffer. The key challenge for a general framework is how to make execution I/O efficient. Elementary revisits classic I/O tradeoffs studied by the database community and adapts them to terabyte-scale analytics. In this way, Elementary has been successfully applied to complex models, e.g., Latent Dirichlet Allocation (LDA), over 500 million documents.

Elementary models statistical inference and learning tasks using factor graphs, and runs Gibbs sampling over factor graphs. A factor graph models random variables and their correlations using a bipartite graph. Elementary stores a factor graph as a set of key-value relations, either on disk or in the main-memory. Like Tuffy and Felix, Elementary accepts Markov Logic Networks (MLNs) programs and , and will first translate an MLN program into a factor graph.

Elementary 0.3 introduces the support of BUGS, which has been widly used in sociology, statistics, and biology over the last decade. Compared with other implementation of BUGS, e.g., OpenBUGS, Elementary uses secondary storage to scale up inference of BUGS models. In the current pre-alpha release, Elementary supports 80% of the models in the OpenBUGS examples archive, and is able to scale up to sociology models with hundreds of millions of random variables!

The current version of Elementary supports four storage backends: main memory, Unix files, Accumulo, and HBase. Elementary is capable of the following tasks over factor graphs:

  • MAP inference, where we want to find out the most likely assignment of random variables;
  • Marginal inference, where we want to estimate the marginal probabilities of random variables;
  • Weight learning, where we want to learn the weights of MLN rules, given training data.
  • Distant supervision, where we want to learn the weight from what we know, and predict what we do not know given a partial set of training data.
To learn more about Elementary, please check out the following three videos.

1. Elementary 0.3 Pre-alpha Teaser!

2. How is Elementary used in our machine reading project?

3. How does Elementary achieve scalability?

Elementary is released under the Apache License V2.0. You can download the source code from our download page.

We gratefully acknowledge the support of Defense Advanced Research Projects Agency (DARPA) DEFT Program under Air Force Research Laboratory (AFRL) prime contract No. FA8750-13-2-0039, the National Science Foundation EAGER Award under No. EAR-1242902 and CAREER Award under No. IIS-1054009, the Office of Naval Research under awards No. N000141210041 and No. N000141310129, Sloan Research Fellowship, American Family Insurance, and Google. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of DARPA, AFRL, ONR, NSF, or the US government.