Victor Syntax by Example
Victor syntax includes "Model Specification", "Model Instantiation", and "Model Application" features. Syntax for each feature is explained below:
The user first needs to create a model "specification" that has an objective function and a gradient step. Then, using the model "instance" we create an initial model which is refined using the given data "examples". Finally, we use our model instance on the model "application".
In the examples section we have explained simple models that we implemented for the following tasks:
- Simple Linear Models : Logistic Regression
- Labeling : Conditional Random Fields (CRFs)
- Graph Partitioning : Minimum Cut
- Robust Statistics : Robust Average
Below we explain the Victor syntax with an example.
In this phase, we setup the data structures and declare the model type. We also register the gradient and objective functions.
The user can write a model specification as:
-- This deletes the model specification DELETE MODEL SPECIFICATION numpy_leastsquares_model; -- This creates the model specification CREATE MODEL SPECIFICATION numpy_leastsquares_model ( model_type=(python,numpy) as (w,v), --(*) data_item_type=(python,python) as (vec, label), --(*) objective=examples.NUMPY_L2.numpy_l2.mse_loss, objective_agg=RMSE, grad_step=examples.NUMPY_L2.numpy_l2.grad, --(*) bulk_grad=examples.NUMPY_L2.numpy_l2.bulkgrad );
Below we explain each of the elements that user has defined in the above specification:
The model_type field defines the type of the model. This spec creates a "numpy_leastsquares" model whose type is defined as a (python,numpy) pair.
A Python type element is stored as a byte array in the database. Python type allows users to use numpy vectors as well. We can also use PostgreSQL types directly (such as float8[ ][ ] for natively stored matrices).
In this example, data_items are stored as serialized pairs of python type vector and python type label.
objective is the "item" objective. It takes a fully qualified python path (relative to the PYTHONPATH environment variable) to the user-defined function that computes the objective value. In this spec, the mse_loss function is in numpy_l2 module that lives in examples.NUMPY_L2 package.
objective_agg specifies how we combine the individual scores. For example, we can sum all the square errors. This function should be an aggregate function registered with PostgreSQL. In the near future, these aggregates can be written in python as reducers.
grad_step is the user-defined gradient function. . This is again a fully qualified python name.
bulk_grad is an option that tries to mitigate the high overhead of Python function calls. With this option, instead of getting a model and "data item", Victor gets a model and a "data set" and a permutation of that data set to work with. This reduces the number of Python function calls.
Other notes:
The (*) in front of each code line shows that the element is a required element by Victor. Other elements are optional, and are used when it is deemed necessary by the user for their specific statistical problem.
Lines started by "--" are considered comments in Victor's syntax.
Note that we first delete the model specification and then create it. This is because the model cannot be created again with the same name in the database.
The code below shows Victor's model instantiation syntax:
CREATE MODEL INSTANCE forest_least_squares EXAMPLES forest_data(vec, label) --(*) MODEL SPEC numpy_leastsquares_model --(*) INIT_FUNCTION examples.NUMPY_L2.numpy_l2.init --(*) STOP WHEN examples.NUMPY_L2.numpy_l2.stopping_condition --(*) ;
In this example, we declare a model instance named "forest_least_squares" that uses the numpy_leastsquares MODEL SPEC (created at the Model Specification phase). This model instance uses the EXAMPLES from the forest_data table whose fields are vec and label. The user also specifies how to initialize the model using INIT_FUNCTION (in this case, examples.NUMPY_L2.numpy_l2.init)
STOP WHEN is a python function that is called after each epoch to check if we should stop refining our objective value.
We can use the model instance that we created on an application as follows:
CREATE MODEL APPLICATION FUNCTION test_numpy_leastsquares function_name=examples.NUMPY_L2.numpy_l2.dot(modelasd forest_least_squares, vec python) model_instance=forest_least_squares RETURNS FLOAT8 ;
After creating the model specification and the model instance, we are now ready to use our model instance in different applications. For example, we can use the forest_least_squares model instance to classify the forest data set.
For the model application, we first specify the application name; test_numpy_leastsquares. We also should specify the function_name: the function (named "dot" in this example) that takes the model and the test data to evaluate test data (find the label of the entity). Finally, we specify which model instance we want to use and the value type that the application function RETURNS.
We can use this application function by a simple SQL query on our test data as follows:
SELECT test_numpy_leastsquares(entity_vec) FROM forest_test_data;
In the examples section you can find different examples of how we have used Victor to solve a variety of statistical problems.