Bismarck unifies the underlying architecture of in-RDBMS analytics using Incremental Gradient Descent (IGD).
What Bismarck Tries to Solve:
The increasing use of statistical data analysis in enterprise applications has created an arms race among database vendors to offer ever more sophisticated in-RDBMS analytics. One challenge in this race is that each new statistical technique must be implemented from scratch in the RDBMS, which leads to a lengthy and complex development process. We argue that the root cause for this overhead is the lack of a unified architecture for in-RDBMS analytics. Bismarck is our first step towards such a unified architecture.
How Bismarck Works:
Bismarck's idea is that IGD can be efficiently implemented in an RDBMS leveraging features already available in most RDBMSs. Bismarck integrates IGD within the framework of a User-Defined Aggregate (UDA). To implement analytics techniques like Logistic Regression, Support Vector Machine, Low-Rank Matrix Factorization, etc., the developer has to write the three standard functions of a UDA - Initialize, Transition and Terminate. The differences in the implementations of the analytics techniques lie mainly in the Transition function. By abstracting out the operations of IGD into cleanly defined functions, Bismarck is able to share most of the system code across analytics techniques.
By leveraging the shared-memory capabilities provided by almost all RDBMSs, Bismarck also provides a more efficient shared-memory implementation of IGD with no changes needed to the RDBMS code.
What Bismarck Can Achieve:
Bismarck lowers the development overhead to add a new analytics technique. For example, given an implementation of Logistic Regression in Bismarck, fewer than 20 lines of code (in C) need to change to add Support Vector Machine. Even a more complex technique like Low-Rank Matrix Factorization can be added with changes to fewer than 60 lines of code. Due to its generic nature, Bismarck can very easily be ported to any RDBMS. Furthermore, Bismarck's unified architecture enables us to apply performance optimizations like parallelism to several analytics techniques in a unified way rather than in an ad hoc, per-technique fashion.
By leveraging the RDBMS effectively, Bismarck achieves high performance and scalability on all the analytics techniques it handles. Bismarck is competitive to, and often much faster than, state-of-the-art commercial in-RDBMS analytics tools on several analytics techniques, while achieving the same quality.
For more technical details about Bismarck, please refer to our upcoming SIGMOD 2012 paper.
The source code and datasets are available at the download page. Bismarck is released under the Apache License, Version 2.0.
Support and Collaboration
Bismarck is kindly supported by Greenplum and Oracle. And we are pleased to contribute to MADlib. We also thank the generous support given by the National Science Foundation CAREER Award under IIS-1054009 and the Office of Naval Research under award no. N000141210041.