Conditional Random Field
Problem and Data
We demonstrate how to run Conditional Random Field on the
CoNLL dataset.
And the data formated to be used by Bismarck can be downloaded from
Bismarck Download.
The schema of the conll
table is as follows:
Column | Type | Modifiers --------+-----------+----------------------------------------------------- did | integer | not null default nextval('conll_did_seq'::regclass) uobs | integer[] | bobs | integer[] | labels | integer[] |
Python-Based Front-End
An example spec file for this task is as given below (also available in the bin folder as crf-spec.py):
verbose = False model = 'crf' model_id = 4444 data_table = 'conll' feature_cols = 'uobs, bobs' label_col = 'labels' ndims = 7448606 nulines = 19 nblines = 1 nlabels = 22 stepsize = 0.05 decay = 0.95
The stepsize and decay values were picked for this dataset after a grid search to get minimum loss value. To invoke the training, run the following command:
python bin/bismarck_front.py bin/crf-spec.py
SQL-Based Front-End
A SQL query for training the CRF model is as follows:
SELECT crf('conll', 4444, 7448606, 22, 19, 1, 20, 0.3, 0.05, 0.95, 't', 't');
The same values are input here, in addition to iteration = 20, and mu = 0.3. The column names are implicitly assumed here to be the same as in the given schema. An alternate SQL query with implicit default values for many of the parameters (refer Using Bismarck) is as follows:
SELECT crf('conll', 4444, 7448606, 22, 19, 1);
Model Application
The trained model can be applied for prediction using the crf_pred
function:
SELECT crf_init(4444); CREATE TABLE conll_pred AS SELECT did, crf_pred(4444, uobs, bobs) FROM conll; SELECT crf_clear(4444);
Data Preparation
Bismarck takes a feature encoding format for CRF training. A script to convert from the popular CRF++ format is provided in the bin folder as crf_data_prepare.py (Notice: the memory consumption of this can be large for complex models):
Usage: python crf_data_prepare.py [crf++ format file] [template file] \ [label output file] [observation output file] [data output for Bismarck]
An example to run this script for CoNLL dataset:
cd bin python crf_data_prepare.py /path/to/bismarck_data/conll-crf++-format.txt \ /path/to/bismarck_data/conll-template.txt \ label2id.txt obs2id.txt conll-bismarck.txt
The number of distinct labels and the number of features in weight vector will be output to screen:
Number of labels: 22 Number of features: 7448606
The Bismarck format output file can be used to import data to database:
psql -c "DROP TABLE IF EXISTS conll CASCADE;" psql -c "CREATE TABLE conll (did serial, uobs integer[], bobs integer[], labels integer[]);" cat conll-bismarck.txt | psql -c "COPY conll (uobs, bobs, labels) FROM STDIN;"
After copying, number of tuples in table conll
should be 8936.
The [label output file]
and [observation output file]
contain
information about the encoding.