-
Notifications
You must be signed in to change notification settings - Fork 6
ImplementationNotes
The main abstractions in the code are:
- Engine
- Algorithm
- ModelApplication
- CorpusRepresentation
- FeatureSpecification and related
- Exporter
- CorpusExporter
Needed:
- Feature specification and data directory
- Instance annotations
- If the algorithm is sequence tagging algorithm, the sequence annotation
Main steps:
- if we have a known sequence tagging algorithm (currently Mallet seq only) check that the SequenceSpan type is specified, otherwise that it is not specified
- Read the feature specification
- create the engine for the algorithm selected using `Engine.createEngine(algorithm,parms,featureInfo,targetType,dataDirectory)
- get the corpus representation from the engine
- for each document
- (add the internal class feature)
- send all instance (and sequence) annotations to the corpus representation
- TODO: this will change so that the annotations get sent to the engine instead!
- finish processing of the data (call corpus representation finish method): for any re-scaling etc
- TODO: put this on the engine
- gather information to get saved in the info file (which is part of the engine)
- call
engine.saveEngine(datadir)
(TODO: engine already knows directory, could make parm-less)
- call the engine.trainModel method
- for some engines, this means that the in-memory representation will first get exported to run an external command for training
For sequence tagging training and regression, the main steps are mostly the same.
Problems/TODO:
- instances are always stored in memory which is not feasible for very large corpora
- it should be possible to instead immediately write instances out to a file
- the finish method would then have to re-read and re-write that file somehow
- the train method would then have to re-read that file one or more times
- ??? how to decide when to use out-of-core and when the use in-memory??? For all external algorithms, we always need to export anyway, so we could always use out-of-core from the start. For non-external algorithms, out-of-core is usually not useful?
- To export for e.g. Weka, we would need to know the header of the arff file first which is not possible For this, we need to export a temporary file, then export the header, then append the data to the header (unless weka supports some other format where the header/metadata can be separate from the data)
- If we always separate exporting from training, even with internal algorithms, we may get a cleaner implementation for experimenting.
Main steps:
- (re)create the engine from the saved model files. As part of this, also recreate the corpus representation (in most cases this is a Mallet corpus representation which includes our own subclass of Pipe, which allows us to preserve all we need to convert annotations to features/attributes)
- for each document, call engine.classify and pass on the instances, sequences etc. This creates a sequence of classification objects which are used to actually modify the document (either create new annotations or put the class on the existing instance annotation)
- for some engines, engine.classify really sends a representation of the instances to a process or server and gets back the classification from there
For classification and regression, the independent features are implemented as a Mallet FeatureVector object. Attribute names as generated by the FeatureExtraction class are mapped to indices in the feature vector using the data alphabet of the pipe.
FeatureVector instances always use sparse, non-binary representation. This means that all values which are zero are not actually stored in the instance, instead the instance keeps track of how many locations are actually used and maps location numbers to indices.
To get all the non-zero values of a feature vector and their indices (sparse representation):
FeatureVector fv = (FeatureVector)instance.getData();
for(int loc=0;loc<fv.numLocations(); loc++) {
int index = fv.indexAtLocation(loc);
double value = valueAtLocation(loc);
}
To get all values of the vector:
int nrFeatures = pipe.getDataAlphabet().size();
FeatureVector fv = (FeatureVector)instance.getData();
for(int index=0; index<nrFeatures; index++) {
double value = fv.value(index);
}
Notes:
- Sparse FeatureVector objects do not know about the "true" size of the sparse vector.
- FeatureVector.location(index) returns the location of the index-th dimension if non-zero and -1 for zero (non-stored) locations.
- FeatureVector.value(index) returns the value of that index, 0.0 if not any non-stored location (irrespective of true size)
- FeatureVector.valueAtLocation(location) returns the value at that location or throws an exception if location does not exist
We distinguish two tasks: classification and regression: for classification, the target alphabet will be an instance of a LabelAlphabet, for regression it will be null;
For classification, the target of each instance is:
- a String for ordinary classification
- an instance of NominalTargetWithCosts for classification where we have a cost vector for each instance
- a Double for regression
For classification, to get the actual String label of an instance:
LabelAlphabet la = (LabelAlphabet)pipe.getTargetAlphabet();
Object target = instance.getTarget();
Label l = la.getLabel(target);
// if this is ordinary classification, the entry for the label should be a String
String targetString = (String)l.getEntry();
// if this is classification with per-instance cost vectors, the entry for the label is a NominalTargetWithCosts instance:
NominalTargetWithCosts ntwc = (NominalTargetWithCosts)l.getEntry();
String targetString = ntwc.getClassLabel();
double[] costs = ntwc.getCosts();
Brought to you by the GATE team