Victor: Logistic Regression
Problem Definition
Logistic regression is part of a category of statistical models called generalized linear models. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these (Definition taken from here).
Data set
DBLife is a prototype system that manages information for the database research community (see dblife.cs.wisc.edu). DBLife is developed by the Database group at University of Wisconsin-Madison and Yahoo! Research. It contains information about Databases papers, conferences, people and more.
Victor Model and Code
The following code shows the model specification and model instantiation for the Logistic Regression problem applied to the DBLife Web site data set. This model is used to classify papers inside the DBLife Web site to categories.
-- This deletes the model specification DELETE MODEL SPECIFICATION logit_l1_two; -- This creates the model specification CREATE MODEL SPECIFICATION logit_l1_two ( model_type=(python) as (w), data_item_type=(int[], float8[], int) as (k,v, label), objective=examples.LINEAR_MODELS.loss_functions.logit_loss_ss, objective_agg=SUM, grad_step=examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.logit_l1_grad ); -- This instantiates the model CREATE MODEL INSTANCE paperareami EXAMPLES dblife_tfidf_split(k, v, label) MODEL SPEC logit_l1_two INIT_FUNCTION examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.init STOP WHEN examples.LOGISTIC_REGRESSION.logit_l1_sparse_split.stop_condition ;
Above we have defined a python-type model which means that it is stored as byte array in the database. The data items are composed of the 3 values: index k, vector v, and label which are stored as an integer vector, float vector, and integer respectively. (Note: the label cannot be 1 and 0; instead it should be 1 and -1.) We specify the loss function, and that the scores are going to be aggregated by the SUM aggregator. Finally, we define the gradient step for the model.
In the code section below, you can see the loss and gradient function that the user provides. Note that this code is defined in a few lines of python using the utilities that Victor provides.
# logit l1 sparse gradient function def logit_l1_grad( model, (indexes, vectors, y) ): lm = model[0] wx = victor_utils.dot_dss(lm.w,indexes,vectors) sig = victor_utils.sigma(-wx*y) victor_utils.scale_and_add_dss(lm.w,indexes, vectors, lm.stepsize*y*sig) victor_utils.l1_shrink_mask(lm.w, lm.mu*lm.stepsize,indexes) model[0].take_step() return model # Calculates logit loss for sparse vectors and returns the value def logit_loss_ss(model, (index, vecs, y) ): lm = model[0] wx = victor_utils.dot_dss(lm.w, index, vecs) err = log( 1+ exp( - y*wx ) ) return err
For instantiating the model, we specify how to initialize the model by giving it a function name. Also, we specify when we should stop refining the model. Again, these functions are written in a few lines of python code as seen below:
def init(): return (simple_linear.LinearModel(41270,mu=1e-2),) def stop_condition(s, loss): if not (s.has_key('state')): s['state'] = 0 s['state'] += 1 return s['state'] > 10
Coming soon.
Running the Example
Run the following command to run the logistic regression example.
$ VICTOR_SQL/bin/victor_front.py VICTOR_SQL/examples/LOGISTIC_REGRESSION/logit_l1.spec
The expected output for this example is shown in the installation guide.