Victor: Robust Average

Problem Definition

Robust statistics seeks to provide methods that emulate popular statistical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical methods rely heavily on assumptions which are often not met in practice. In particular, it is often assumed that the data errors are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates (Definition taken from here).

Data set

DBLife is a prototype system that manages information for the database research community (see DBLife is developed by the Database group at University of Wisconsin-Madison and Yahoo! Research. It contains information about Databases papers, conferences, people and more.

Victor Model and Code

The following code shows the model specification and model instantiation for the Robust Averaging problem applied to the DBLife Web site data set.

-- This deletes the model specification

-- This creates the model specification
   model_type=(python,python) as (w,what),
   data_item_type=(int[], float8[], int) as (k,v, label),   

-- This instantiates the model
CREATE MODEL INSTANCE dblife_l2_robust
   EXAMPLES dblife_tfidf_split(k, v, label)
   MODEL SPEC robust_logit
   INIT_FUNCTION examples.ROBUST_AVERAGE.robust_average.init
   STOP WHEN examples.ROBUST_AVERAGE.robust_average.stopping_condition

1. Model Specification

This specification creates a "robust_logit" model whose type is defined as a (python,python) pair. The data items are composed of the 3 values: index k, vector v, and label which are stored as an integer vector, float vector, and integer respectively. We specify the loss function, and that the scores are going to be aggregated by the SUM aggregator. Finally, we define the gradient step for the model.

In the code section below, you can see the loss and gradient function that the user provides. Note that this code is defined in a few lines of python using the utilities that Victor provides.

# Calculates logit loss for sparse vectors and returns the value
def logit_loss_ss(model, (index, vecs, y) ):
   lm  = model[0]
   wx  = victor_utils.dot_dss(lm.w, index, vecs)
   err = log( 1+ exp( - y*wx ) )
   return err

def ell2_ball_constraint_avg((m, mhat) , (indexes, values, y)):
   wx  = victor_utils.dot_dss(m.w, indexes, values)
   err = (wx - y)
   etd = - m.stepsize * err
   victor_utils.scale_and_add_dss(m.w,indexes, values, etd)
   victor_utils.l2_project(m.w, m.B)	
   incremental_average_sparse(mhat.w, m.w, m.stepsize, indexes)
   return (m,mhat)

2. Model Instantiation

For instantiating the model, we specify how to initialize the model by giving it a function name. Also, we specify when we should stop refining the model. Again, these functions are written in a few lines of python code as seen below:

def init():
   m  = simple_linear.LinearModel(41270,B=1.5)
   m2 = simple_linear.LinearModel(41270,B=1.5)
   return (m,m2)

def stopping_condition(s, loss):
   if not (s.has_key('state')):
      s['state'] = 0
   s['state'] += 1
   return s['state'] > 2

3. Model Application

Coming soon.

Running the Example

Run the following commands to run the example. Note that here we have separated the files for the model specification and instantiation. This does not affect the results since the SQL commands run independently.

$ ../../bin/ robust_avg.create.spec
$ ../../bin/ robust_avg.mi.spec