Columbus Usage with Examples
We now explain the syntax for using Columbus operations to perform exploratory feature selection. The operations are invoked from the R console.Get the Datasetnames and handler
An analyst can get the available datasetnames and get the handle for the dataset as followsGetDatasetnames() [1] "Telecom" id <- GetDatasetId("Telecom")GetDatasetnames retrieves the available datasets by querying the database. Further the analyst can print the features in the dataset by issuing the following command
print(GetFeatureNames(GetFeatureIndices(id), dataset.id = id))
Feature set operations:
In the Columbus system, three feature set operations are supported:- AssignFeatureSet The analyst can create a set of features from the dataset using AssignFeatureSet. Each featureset is specific to a dataset.
- AddFeatureSet The analyst can combine two featuresets using AddFeatureSet
- DelFeatureSet The analyst can remove a set of features using DelFeatureSet
feat.set.1 <- AssignFeatureSet(c("DATAVOLUME", "NUMMMSOUT", "NUMVASOUT", "NUMSMSVASINC"), dataset.id = id) feat.set.2 <- AssignFeatureSet(c("DURATIONFIXEDINC", "NUMSMSCMPINC", "NUMSMSINTEROUT"), dataset.id = id) feat.set.3 <- AddFeatureSet(feat.set.1, feat.set.2, dataset.id = id) feat.set.4 <- AssignFeatureSet(c("NUMSMSINTEROUT"), dataset.id=id) feat.set.5 <- DelFeatureSet(feat.set.3, feat.set.4, dataset.od = id)The data types of the parameters are given below
- FeatureSetVector [Quoted string] or [Integer] Denotes the feature names in the dataset. If Integer values are given, then the indices are internally mapped to feature names
- dataset.id [Integer] Dataset identifier.
Descriptive Statistic Operation
In the columbus system, we support the following descriptive statistic operations- CorrelationX: Given a feature set and a dataset id, the function computes the pair wise correlation among the features in the dataset.
- CorrelationY: The function computes the correlation with the target. Note that the target is implicit from the dataset.id given.
- CoeffLearner: The function learns the co-efficients for the given feature set and the dataset. The function is generic and we currently support two learning models: Incremental Gradient Descent and Conjugate Gradient. The configuration parameters for the learning methods are exposed to the user, where she can specify appropriate values.
corrx.val <- CorrelationX(feat.set.1, dataset.id = id) corry.val <- CorrelationY(feat.set.2, dataset.id = id) igd.coef.learn <- CoeffLearner(feat.set.1, type="igd", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0) cg.coef.learn <- CoeffLearner(feat.set.2, type="cg", num.iters = 5, init.wt = 0)The datatypes of the parameters are given below
- type : [Quoted string] Identifies the type of coefficient learner. Allowed string constants : "igd", "cg"
- num.iters : [Integer] Denotes the number of iterations that the co-efficient learner should be iterated.
- step.size : [Float] Denotes the learning rate in IGD
- decay : [Float] Denotes the devay value in IGD
- init.wt : [Float] Initial weight to be assigned.
Evaluate Operation
In the Columbus system, a feature set evaluation involves two phases: train and test. Train phase is nothing but learning coefficients for the features and the test phase can be crossvalidation or Akaike Information Criterian score. An example usage of evaluate operation is given belowcv.eval <- Evaluate(feat.set.1, train.type = "igd", eval.type = "cv", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0, num.folds = 5) aic.eval <- Evaluate(feat.set.1, train.type = "cg", eval.type = "aic", num.iters = 5, init.wt = 0)Additional parameters for the evaluate operation include
- eval.type : [Quoted string] Denotes the evaluation type to be used. Valid string constants are "cv" and "aic"
Explore Operation
In the Columbus system, given a feature set explore operation can be used to add or delete one another feature from the available set of features. It involves evaluating a group of feature sets and choosing a best feature set. An example usage of explore operation is given belowadd.feat.set <- StepAdd(inp.set = feat.set.1, mask.set = feat.set.2, train.type = "igd", eval.type = "cv", num.iters = 5, step.size = 0.01, decay = 1, init.wt = 0, num.folds = 5) del.feat.set <- StepAdd(inp.set = feat.set.1, train.type = "cg", eval.type = "aic", num.iters = 5, init.wt = 0)Additional parameters for the evaluate operation include
- mask.set : [feature set]: Denotes the list of features that should be omitted while adding a new feature.
fs1 <- AssignFeatureSet(c("DATAVOLUME", "NUMMMSOUT", "NUMVASOUT", "NUMCALLSFIXEDOUT", "DURATIONFIXEDINC", "NUMSMSINTEROUT", "NUMSMSCMPINC", "NUMSMSVASINC"), dataset.id = id) fm1 <- CorrelationX(fs1, dataset.id = id) fs2 <- AssignFeatureSet(c("NUMCALLSFIXEDOUT"), dataset.id = id) fs3 <- DelFeatureSet(fs1, fs2, dataset.id = id) fm2 <- CoeffLearner(fs3, "igd", num.iters = 2, dataset.id = id) fs4 <- BestK(fm2, 6, dataset.id = id) fm6 <- Evaluate(fs4, "cg", "cv", num.iters=2, num.folds=3, dataset.id=id) # change to 5 folds fs5 <- StepDel(fs4, "cg", "aic", dataset.id = id)
The analyst can choose the save the program as given below
SaveSession("ColumbusProgram1", dataset.id = id)Further she can run the same program in batch mode and on a different or same dataset as shown below.
ExecuteProgram("ColumbusProgram1", dataset.id = id)
For more detailed examples, please refer to the Examples Page.