Classification Service

The classification service classifies any configured attribute, as long as it has domain enum values. The classification is applied on a system-level meaning either all files are considered for classification or none. The only differentiator is the language of a document. Full text is required the classification service to work.

The classification service receives a message from the AI router if the text or language are updated. If a model has been trained for the given language and set to active, the service will classify the document and send the document type and topics back to be saved in the datastore. If the relevant attributes have been set during file upload, the value(s) will not be overwritten. If no model has been trained for the given language or no model is set active, the classification service will not be able to classify the document.

Classification Score

The classification is a number between 0 and 1 that represents the confidence the classifier has in its prediction. Depending on the model used, the classification score may not be available. In that case null/none is returned instead of a score. For all models trained within ProcessMaker IDP the score is available. A low score represents a lack of confidence in the classification. Tests showed that on a set of 1200 documents 6 out of 9 misclassified items had the lowest score around 55% confidence. The difference between the lowest classified correct item was relatively small. Adding a threshold based on this score is therefore quite sensitive. Checking classifications based on this score should however help in fixing most misclassified items.

Where most classifiers are able to provide some measure of confidence it is generally not based on a linear scale. The confidence is therefore calibrated. Calibration can be considered as an extra step during training to make classification scores more correct. As with classification itself the score is likely to become more stable/accurate when more training data is available.

Classification example

Having a very small training set of 2 documents per language (nl, en) and document types (A,B,C,D). =-

File 1 en/typeA Score 0.8306169
File 2 en/typeA Score 0.81266636
File 3 en/TypeB Score 0.83028245
File 4 en/TypeB Score 0.78634757
File 5 nl/TypeC Score 0.7820717
File 6 nl/TypeC Score 0.74933964
File 7 nl/TypeD Score 0.71249354
File 8 nl/TypeD Score 0.8007787

Now, extending the training set to 20 documents each. This will return the following classification scores for the same file uploads in the previous example.

File 1 en/typeA Score 0.9484285
File 2 en/typeA Score 0.97549295
File 3 en/TypeB Score 0.92118645
File 4 en/TypeB Score 0.9167199
File 5 nl/TypeC Score 0.9011106
File 6 nl/TypeC Score 0.7992359
File 7 nl/TypeD Score 0.96265835
File 8 nl/TypeD Score 0.92745423

As can be seen, the score increased by at least 0.1 for almost all uploaded documents except for File 6. Thus, the classification score is heavily dependent on its training set. To restrict the classifier from updating the value in the atribute only when the classification score is lower than a particular value, the classification threshold is used. Classification threshold is set at a model level and its default value is 0.5 meaning that if the classification score is lower than 0.5, the predicted value will be disregarded by the system.

Training

Before a classifier can be trained within ProcessMaker IDP, a Classification configuration has to be created in IDP Admin - Classification - Configurations. After saving the Classification configuration, ProcessMaker IDP will generate attributes which will contain the classification scores.

All attributes used in Classification configurations will be shown in the file upload modal, so an authorized user (trainer role) can provide values to be used in the training. During training, only documents that have the attribute USE_FOR_TRAINING set to true are considered. Documents which do not have USE_FOR_TRAINING attribute set to true are not visible from the training dashboard and therefore ignored during training.

Models are trained per language. This means it is possible to first train an English model and later a Dutch model. The trainer can create multiple variants of the same model. Only one model per language can be set to 'active'. Models can be removed via IDP Admin - Classification - Models page. The system will directly stop classifying documents of this language when the model is removed or deactivated. This applies to all trained models. Labels are part of the same model. If a model is trained for [documentType1, documentType2] and later a model is trained for [documentType2, documentType3], the resulting model will only be able to output either documentType2 or documentType3. If the user requires a classifier to identify either documentType1 or documentType2 or documentType3, then all three document types should be present in the same training request. The minimum number of unique labels to train is two and the training set must have at least five documents per label to successfully start a training.

To initiate a training, first select the Classification configuration and language. If a language is missing in the list, the training set doesn't contain documents with that language. Validate the training set:

Are all labels present?
Does every label have at least 5 documents?
Is the training set balanced? (roughly same number of documents per label)

Start the training by clicking at the Train button. The duration differs, processing large datasets (thousands of documents with multiple pages of text) can take up to more than one hour.

Classification Models

Models under IDP Admin - Classification provides the user information of the status and metrics accompanied by training. From a maintenance perspective, the Models page offers options to active and deactivate models, as well as removing models.

In particular, from the Models dashboard the user can see the following information:

Entity Reference: represents the entity used for the training. For now it can be only FILE.
Attribute Reference: this is the attribute used as a label.
Language: as the model is trained per language, here the user sees which language was used.
Status: indicates the step the training process is at. There are four available options: Requested, Started, Finished, Failed.
Active: indicates whether the model is used in production. Default value is false. The user can change this value in order to use the model for classification. By changing the model to active=True, the previously active model will be automatically disabled.
Classification Threshold: this is a float number between 0 and 1 with default value 0.5. Classification threshold is used during prediction and restricts the service to returning only predictions whose classification score is higher than its value.
Average F1 score: a harmonic mean of precision and recall. Indicates how well model can predict all the classes in total. Ranges between 0 and 1, with 1 being perfect and 0 the worst performance.
Average Kappa score: Kappa Cohen score is a more strict score which measures agreement level between predicted labels and true labels. Ranges between 0 and 1, with 1 being perfect and 0 the worst performance.
Classification report: precision, recall, F1-score per label can be found in this attribute's value. • Confusion matrix: this is a table which shows the distribution of predicted labels per label across all labels in the dataset.
Dataset distribution: The distribution of labels used in the training and testing of the model. •
Created at: datetime when the training task was created.
Created by: user who created the training task.
Modified at: datetime when the training task was last modified.
Modified by: user who modified last the training task.
Id: unique Id of the training task

A user with sufficient access rights can delete the entries from Models. This action removes the entry from ProcessMaker IDP and removes the trained model. So, in case the model is deleted before the training is completed, the model's binaries will remain in the classification service pod. Therefore, it is suggested to delete the model after training has been completed.

PreviousOCR Service NextNamed Entity Recognition

Last updated 1 year ago