Victor: Scalable Numeric Optimization with an RDBMS

Conditional Random Field

Problem and Data

We demonstrate how to run Conditional Random Field on the CoNLL dataset. And the data formated to be used by Bismarck can be downloaded from Bismarck Download. The schema of the conll table is as follows:

 Column |   Type    |                      Modifiers                      
--------+-----------+-----------------------------------------------------
 did    | integer   | not null default nextval('conll_did_seq'::regclass)
 uobs   | integer[] | 
 bobs   | integer[] | 
 labels | integer[] |

Python-Based Front-End

An example spec file for this task is as given below (also available in the bin folder as crf-spec.py):

verbose = False
model = 'crf'
model_id = 4444
data_table = 'conll'
feature_cols = 'uobs, bobs'
label_col = 'labels'
ndims = 7448606
nulines = 19
nblines = 1
nlabels = 22
stepsize = 0.05
decay = 0.95

The stepsize and decay values were picked for this dataset after a grid search to get minimum loss value. To invoke the training, run the following command:

python bin/bismarck_front.py bin/crf-spec.py

SQL-Based Front-End

A SQL query for training the CRF model is as follows:

SELECT crf('conll', 4444, 7448606, 22, 19, 1, 20, 0.3, 0.05, 0.95, 't', 't');

The same values are input here, in addition to iteration = 20, and mu = 0.3. The column names are implicitly assumed here to be the same as in the given schema. An alternate SQL query with implicit default values for many of the parameters (refer Using Bismarck) is as follows:

SELECT crf('conll', 4444, 7448606, 22, 19, 1);

Model Application

The trained model can be applied for prediction using the crf_pred function:

SELECT crf_init(4444);
CREATE TABLE conll_pred AS SELECT did, crf_pred(4444, uobs, bobs) FROM conll;
SELECT crf_clear(4444);

Data Preparation

Bismarck takes a feature encoding format for CRF training. A script to convert from the popular CRF++ format is provided in the bin folder as crf_data_prepare.py (Notice: the memory consumption of this can be large for complex models):

Usage: python crf_data_prepare.py [crf++ format file] [template file] \
[label output file] [observation output file] [data output for Bismarck]

An example to run this script for CoNLL dataset:

cd bin
python crf_data_prepare.py /path/to/bismarck_data/conll-crf++-format.txt \
/path/to/bismarck_data/conll-template.txt \
label2id.txt obs2id.txt conll-bismarck.txt

The number of distinct labels and the number of features in weight vector will be output to screen:

Number of labels: 22
Number of features: 7448606

The Bismarck format output file can be used to import data to database:

psql -c "DROP TABLE IF EXISTS conll CASCADE;"
psql -c "CREATE TABLE conll (did serial, uobs integer[], bobs integer[], labels integer[]);"
cat conll-bismarck.txt | psql -c "COPY conll (uobs, bobs, labels) FROM STDIN;"

After copying, number of tuples in table conll should be 8936.

The [label output file] and [observation output file] contain information about the encoding.

Download

Examples