We focus on understanding how systems change as we embed increasingly sophisticated analytics in data processing systems. The Bismarck project developed a unified architecture for machine-learning-style analytics in traditional data processing systems (relational databases and Hadoop) that is now used by Pivotal’s MADlib, Oracle, and soon-to-be-released add-on for Cloudera’s Impala. To understand this problem, we have constructed prototypes like Hogwild!, Jellyfish, and DimmWitted that allow us to explore fundamental tradeoffs in these systems (e.g. lock contention). More recently, the project has been examining how more sophisticated analytics, e.g., linear programming, can be developed in the same framework, see HottTopix and Thetis for examples in this direction. Much of this work is inspired by our work on the IceCube project, a neutrino telescope at the South Pole.
- Columbus provides a declarative framework of operations for feature selection over in-RDBMS data.
- Victor-SQL integrates incremental schemes with an RDBMS via a (hopefully) easy-to-use python interface.
- Bismarck unifies the underlying architecture of in-RDBMS analytics using incremental gradient schemes.
- HOGWILD! discovers a new way of parallelizing incremental gradient algorithms. Hogwild's approach is simple: get rid of locking entirely! We prove that as long as the data are sparse, Hogwild achieves linear speedups.
- HottTopixx explores scalable nonnegative matrix factorization.
- Jellyfish exploits a large-scale parallel stochastic gradient algorithm for nonconvex relaxations for large-scale matrix completion. Jellyfish is two orders of magnitude faster to the same error (RMSE) versus any algorithm that we know about!