Named Entity Recognition

The Named Entity Recognition (NER) module is used to recognize entities in PDF and text files. Recognized named entities can be handled in the following ways:

Anonymization: Removes the named entities from the text.
Pseudonymization: Replaces the named entities with a fixed text.
Redaction: Masks the named entities, making them unreadable.

Text Replacement: This feature only works for plain text files and the Full Text attribute. PDF support for text replacement will be available in a future version.
Redaction: Currently, redaction only works for PDF and image files (JPG, PNG, and TIFF).
Annotations: Annotations can be created to visually indicate the recognized named entities in PDF files. This feature helps users quickly identify and manage sensitive information within documents

Recognized Entity Types

Entity Types Recognized by Pretrained Models (Release 2.3)

Language

Entity Types

Afrikaans

Person, Location, Organization, Miscellaneous

Arabic

Person, Location, Organization, Miscellaneous

Bulgarian

Event, Location, Organization, Person, Product

Chinese

Event, Facility, GPE, Language, Law, Location, Miscellaneous, Money, NORP, Ordinal, Organization, Percent, Person, Product, Quantity, Time, Work of Art

Dutch

Person, Location, Organization, Miscellaneous

English (simple)

Person, Location, Organization, Miscellaneous

English (extended)

Cardinal (Numerals that do not fall under another type)
Date (Absolute or relative dates or periods)
Event (Named hurricanes, battles, wars, sports events, etc.)
Facility (Buildings, airports, highways, bridges, etc.)
GPE (Countries, cities, states)
Language (Any named language)
Law (Named documents made into laws)
Location (Non-GPE locations, mountain ranges, bodies of water)
Money (Monetary values, including unit)
NORP (Nationalities or religious or political groups)
Ordinal ("first", "second")
Organization (Companies, agencies, institutions, etc.)
Percentages (Percentage including "%")
Person (People, including fictional)
Product (Vehicles, weapons, foods, etc. (Not services))
Quantity (Measurements, as of weight or distance)
Time (Times smaller than a day)
Work of Art (Titles of books, songs, etc.)

Finnish

Person, Location, Organization, Miscellaneous

French

Person, Location, Organization, Miscellaneous

German

Person, Location, Organization, Miscellaneous

Italian

Location, Organization, Person

Hungarian

Person, Location, Organization, Miscellaneous

Myanmar

Location, Miscellaneous, Number, Organization, Person, Race, Time

Russian

Person, Location, Organization, Miscellaneous

Spanish

Person, Location, Organization, Miscellaneous

Ukrainian

Person, Location, Organization, Miscellaneous

Vietnamese

Person, Location, Organization, Miscellaneous

Configuration

Admins can update NER configurations through the ProcessMaker IDP Admin interface.

In the NER configuration the admin can populate the following fields:

Name: Name of the configuration
Json found entities attribute (optional): which attribute will be populated with a JSON array of found entities.
Condition: Enable/disable the usage of this NER configuration for files with the configured language model. The condition acts like a filter, the administrator can provide values which filter using any of the metadata fields of a file. When multiple conditions match for one file, a warning is raised in the logging and the file won't be processed.
Redaction output format: Possible values include:
- 'Same format as original file' / 'PDF'
- 'Same format as original file' creates a rendition with the same media type, so a JPG file results in a JPG rendition. When set to 'Pdf', the rendition is converted to PDF.
Overwrite Full Text (optional): In case of anonymization or pseudonymization the full text attribute can be updated by removing the found entities or updated with the configured replacement text.
Language Model: Specifies which language model should be used to recognize entities.

Named Entities

For every NER configuration the admin must specify which entity types should be recognized. Use the exact entity types specified above for recognition via the pretrained model.

Name: The name of the NER configuration. Use a descriptive name, since it's used in the JSON result (when configured).
Redact: When enabled, all found entities this named entity will be redacted/masked. A redacted rendition will only be created when at least of the named entities of a configuration has this field enabled and NER detects one or more entities in the uploaded document.
Replacement Value: When a replacement text is configured, this exact text is used to update text files and when configured, the full text attribute.
Annotation Label: Annotation labels can be used to create annotations for processed documents. The admin can select labels from all annotation schemas. Authorized users have the options to view annotations via the Document action button in the document viewer.

Regular Expressions

In addition to the named entity recognition via the NER Model, administrators can define these types of regular expressions:

Context-Based Filters

Filter out a named entity based on context. Add the context-based filter regex to the named-entity you want to filter. Example context-based filter:

Entity type: name
Regex: {entity} was absent
Text: John was absent
Result: named entity 'John' is removed from the array of found named entities

Correction

Via the correction regex, found entities are trimmed or extended with given text. Add the correction regex to the named entity you want to correct.

Example Correction: Remove Text

Entity type: name
Regex: (\D+)
Text: 'John Smith1 and Michael Smith'
Result: found entities 'John Smit', 'Michael Smith'

Explanation

The example shows how the correction regex can be used to adjust the named entities. In this case, the regex (\D+) matches sequences of non-digit characters, and the resulting entities after applying the regex are 'John Smit' and 'Michael Smith'. Example Correction: Extending

Entity type: name
Regex: (sir/madam {entity})
Text: John Smith is a civil servant
Result: found named entity: 'sir/madam John Smith'

Explanation

The example demonstrates how the correction regex can be used to extend named entities. In this case, the regex (sir/madam {entity}) is used to prepend "sir/madam" to the named entity "John Smith". The resulting named entity after applying the regex is 'sir/madam John Smith'.

Extraction

Where NER (Named Entity Recognition) is used to extract named entities from natural text, regular expressions are used to extract named entities from structured text. Fixed patterns are used for entity types such as email addresses, zip codes, or dates. For names, more context-based patterns are used that include a capture group. Regular expressions in this module are case-insensitive unless configured explicitly.

Fixed Pattern Example

Entity type: zip-code
Regex: [1-9]{4} *[A-Z]{2}
Text: Address: Pr. Catharina-Amaliastraat 5, 2496 XD The Hague
Result: 2496 XD

Capture Group Example

Entity type: name
Regex: Dear sir\/madam ([A-Za-z]*),
Text: Dear sir/madam Smith,
Result: Smith

Filter

Filter out a named entity based on context. Add the context-based filter regex to the named entity you want to filter.

Example Context-Based Filter

Entity type: name
Regex: {entity} was absent
Text: John was absent
Result: named entity 'John' is removed from the array of found named entities

Explanation

This example demonstrates how to use a context-based regex to filter out named entities. In this case, the regex {entity} was absent matches the context where the entity "John" is mentioned as being absent, resulting in the removal of 'John' from the array of found named entities.

Administration and Reprocessing

Reprocessing NER Administrators can reprocess a dossier, a specific folder, or individual files through Named-Entity Recognition (NER). When the top level is selected for reprocessing, all folders and files within those folders will be processed. This feature is useful when the configuration is changed or the feature needs to be applied to existing content.
JSON Conditions
Conditions in NER configurations help target specific documents for processing, allowing for efficient management and application of NER rules.

Language Support

ProcessMaker IDP supports NER in multiple languages, with pre-trained models available for each language. The system can recognize a wide range of entities in various languages, enhancing its usability in a global context.

By leveraging the NER module, ProcessMaker IDP enhances document processing capabilities, ensuring that entities are accurately recognized and managed.

PreviousClassification Service NextAnnotations

Last updated 1 year ago