Victor: Robust Average
Problem Definition
Robust statistics seeks to provide methods that emulate popular statistical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical methods rely heavily on assumptions which are often not met in practice. In particular, it is often assumed that the data errors are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates (Definition taken from here).
Data set
DBLife is a prototype system that manages information for the database research community (see dblife.cs.wisc.edu). DBLife is developed by the Database group at University of Wisconsin-Madison and Yahoo! Research. It contains information about Databases papers, conferences, people and more.
Victor Model and Code
The following code shows the model specification and model instantiation for the Robust Averaging problem applied to the DBLife Web site data set.
-- This deletes the model specification DELETE MODEL SPECIFICATION robust_logit; -- This creates the model specification CREATE MODEL SPECIFICATION robust_logit ( model_type=(python,python) as (w,what), data_item_type=(int[], float8[], int) as (k,v, label), objective=examples.LINEAR_MODELS.loss_functions.logit_loss_ss, objective_agg=SUM, grad_step=examples.ROBUST_AVERAGE.robust_average.ell2_ball_constraint_avg ); -- This instantiates the model CREATE MODEL INSTANCE dblife_l2_robust EXAMPLES dblife_tfidf_split(k, v, label) MODEL SPEC robust_logit INIT_FUNCTION examples.ROBUST_AVERAGE.robust_average.init STOP WHEN examples.ROBUST_AVERAGE.robust_average.stopping_condition ;
This specification creates a "robust_logit" model whose type is defined as a (python,python) pair. The data items are composed of the 3 values: index k, vector v, and label which are stored as an integer vector, float vector, and integer respectively. We specify the loss function, and that the scores are going to be aggregated by the SUM aggregator. Finally, we define the gradient step for the model.
In the code section below, you can see the loss and gradient function that the user provides. Note that this code is defined in a few lines of python using the utilities that Victor provides.
# Calculates logit loss for sparse vectors and returns the value def logit_loss_ss(model, (index, vecs, y) ): lm = model[0] wx = victor_utils.dot_dss(lm.w, index, vecs) err = log( 1+ exp( - y*wx ) ) return err def ell2_ball_constraint_avg((m, mhat) , (indexes, values, y)): wx = victor_utils.dot_dss(m.w, indexes, values) err = (wx - y) etd = - m.stepsize * err victor_utils.scale_and_add_dss(m.w,indexes, values, etd) victor_utils.l2_project(m.w, m.B) incremental_average_sparse(mhat.w, m.w, m.stepsize, indexes) m.take_step() return (m,mhat)
For instantiating the model, we specify how to initialize the model by giving it a function name. Also, we specify when we should stop refining the model. Again, these functions are written in a few lines of python code as seen below:
def init(): m = simple_linear.LinearModel(41270,B=1.5) m2 = simple_linear.LinearModel(41270,B=1.5) return (m,m2) def stopping_condition(s, loss): if not (s.has_key('state')): s['state'] = 0 s['state'] += 1 return s['state'] > 2
Coming soon.
Running the Example
Run the following commands to run the example. Note that here we have separated the files for the model specification and instantiation. This does not affect the results since the SQL commands run independently.
$ cd VICTOR_SQL/examples/ROBUST_AVERAGE/ $ ../../bin/victor_front.py robust_avg.create.spec $ ../../bin/victor_front.py robust_avg.mi.spec