Bismarck Usage with Examples
We now explain the syntax for using Bismarck to perform in-database analytics tasks. Currently, four tasks are implemented in Bismarck:
- Logistic Regression (LR)
- Support Vector Machine (SVM)
- Low-rank Matrix Factorization (LMF)
- Conditional Random Field (CRF)
Bismarck currently offers two front-ends for the user to invoke these tasks for training the models - a python script, and a set of SQL User-Defined Functions (UDFs). Applying the model over new data is done with a simple SQL UDF, as explained later.
Python-Based Front-end
Bismarck can be invoked using the python front-end file bismarck_front.py with the following usage:
python bismarck_front.py [spec_file]
Spec File
The [spec_file] is a plaintext file that describes the anlaytics task by providing values for a pre-defined set of attributes. Each line of the file specifies an attribute-value pair, with the following syntax:
[attribute] = [value]
Here is an example spec file of performing sparse LR on DBLife dataset:
model = 'sparse_logit' model_id = 22 data_table = 'dblife' feature_cols = 'k, v' label_col = 'label' ndims = 41270 stepsize = 0.5 decay = 0.9 is_shmem = True
We now explain the list of attributes, their possible values and what they mean. The attributes are split into groups based on their characteristics.
Value Required:
- model = 'sparse_logit' | 'dense_logit' | 'sparse_svm' | 'dense_svm' | 'factor' | 'crf' : The analytics task name
- model_id = [Integer] : The unique identifier for the learned model instance from a particular task
- data_table = [Quoted String] : Name of the data table
- label_col = [Quoted String] : Name of the label column in the data table
- feature_cols = [Quoted String] : Names(s) of the feature column(s) in the data table. This is task specific:
Dense LR / SVM : Single name of the feature values column Sparse LR / SVM : Comma separated pair of names of the indices column and values column LMF : Comma separated pair of names of the row-index column and column-index column CRF : Comma separated pair of names of the unigrams column and bigrams column
Task-Specific - CRF Only:
- nulines = [Integer] : Number of unigram lines used in the template file, in CRF++ format
- nblines = [Integer] : Number of bigram lines used in the template file, in CRF++ format
- nlabels = [Integer] : Number of possible labels
Task-Specific - LR / SVM / CRF:
- ndims = [Integer] : Number of dimensions the model weight vector contains
- mu = [Double Precision] with Default 1e-2 : Regularization factor
Task-Specific - LMF Only:
- nrows = [Integer] : Number of rows of the matrix to be factored
- ncols = [Integer] : Number of cols of the matrix to be factored
- maxrank = [Integer] : Number of features used
- B = [Double Precision] with Default 2.0 : Regularization factor, max-norm ball radius
- initrange = [Double Precision] with Default 0.01 : Standard deviation of the normal distribution used to draw initial values
Default Value Available:
- verbose = False | True : Whether to print all messages
- is_shmem = False | True : Whether to run the shared-memory version
- is_shuffle = True | False : Whether to shuffle the dataset once before running IGD; writes the data to a new table
- num_iters = [Integer] with Default 20 : Number of iterations for model training
- stepsize = [Double Precision] with Default 0.1 : Initial stepsize for gradient methods
- decay = [Double Precision] with Default 1.0 : Stepsize decay factor, applied once per iteration
Optional (and no default):
- tolerance = [Double Precision] : If set, the training process may terminate before finishing specified number of iteration when improvement is smaller than tolerance
- output_file = [Quoted String] : If set, the trained weight vector of the model will be written out as a tsv file with file name specified
SQL-Based Front-end
Bismarck can also be invoked from within the RDBMS using a set of pre-defined SQL User-Defined Functions (UDFs). The attributes that were provided in a text file for the python front-end are instead provided as arguments to the UDFs. The UDFs allow the user to carry out the analytics task entirely within the RDBMS. Each of the four tasks has a pair of UDFs associated with it. The general syntax is the following:
SELECT [udfname] ( [argument list] );
IMPORTANT: When using the SQL-based front-end, the column names are assumed to be the same as given in the examples in the bismarck_data folder. This is done in order to avoid making the function signatures too long.
SQL Function Signatures
We give the SQL function signatures here. The semantics of the arguments are the same as in the python-based front-end. The following functions can also be invoked without values specified for parameters with default values (those with underlines). To find examples, see the Examples page.
Logistic Regression and Support Vector Machine:
CREATE FUNCTION dense_logit( data_table text, model_id integer, ndims integer, iteration integer, mu double precision, stepsize double precision, decay double precision, is_shmem boolean, is_shuffle boolean) RETURNS VOID ...;
The same argument list above are shared with 'sparse_logit', 'dense_svm' and 'sparse_svm'. After finishing training,
the model will be available in table linear_model
with specified model id (column name mid
).
Low-rank Matrix Factorization:
CREATE FUNCTION factor( data_table text, model_id integer, nrows integer, ncols integer, maxrank integer, iteration integer, b double precision, initrange double precision, stepsize double precision, decay double precision, is_shmem boolean, is_shuffle boolean) RETURNS VOID ...;
After finishing training, the model will be available in table factor_model
with specified model id (column name mid
).
Conditional Random Field:
CREATE FUNCTION crf( data_table text, model_id integer, ndims integer, nlabels integer, nulines integer, nblines integer, iteration integer, mu double precision, stepsize double precision, decay double precision, is_shmem boolean, is_shuffle boolean) RETURNS VOID ...;
After finishing training, the model will be available in table crf_model
with specified model id (column name mid
).
Model Application
Bismarck offers SQL functions to perform prediction using the model inside model tables
(linear_model
, factor_model
, and crf_model
).
Here is an example of calling the prediction function of sparse logistic regression (with binary classes).
SELECT sparse_logit_init(22); CREATE TABLE dblife_pred AS SELECT did, sparse_logit_pred(22, k, v) FROM dblife; SELECT sparse_logit_clear(22);
The first step loads the specified model into memory. It prints an error if the given model id does not exist.
The second step invokes the prediction function that takes in the model id and the features columns of the unlabeled data.
The function applies the model on each data point and returns the LR probability (the probability of belonging to the positive class).
The third step removes the model from memory.
Similarly, the prediction function for SVM (e.g. dense_svm_pred
) returns the predicted class label for that data point
and the one for LMF (factor_pred
) returns the predicted label value. As for CRF, crf_pred
will perform Viterbi inference
and return the most likely label sequence.
For more detailed examples, please refer to the Examples Page.