Named Entity Recognition
Last updated
Last updated
The Named Entity Recognition (NER) module is used to recognize in PDF and text files. Recognized named entities can be handled in the following ways:
Anonymization: Removes the named entities from the text.
Pseudonymization: Replaces the named entities with a fixed text.
Redaction: Masks the named entities, making them unreadable.
Afrikaans
Person, Location, Organization, Miscellaneous
Arabic
Person, Location, Organization, Miscellaneous
Bulgarian
Event, Location, Organization, Person, Product
Chinese
Event, Facility, GPE, Language, Law, Location, Miscellaneous, Money, NORP, Ordinal, Organization, Percent, Person, Product, Quantity, Time, Work of Art
Dutch
Person, Location, Organization, Miscellaneous
English (simple)
Person, Location, Organization, Miscellaneous
English (extended)
Cardinal (Numerals that do not fall under another type)
Date (Absolute or relative dates or periods)
Event (Named hurricanes, battles, wars, sports events, etc.)
Facility (Buildings, airports, highways, bridges, etc.)
GPE (Countries, cities, states)
Language (Any named language)
Law (Named documents made into laws)
Location (Non-GPE locations, mountain ranges, bodies of water)
Money (Monetary values, including unit)
NORP (Nationalities or religious or political groups)
Ordinal ("first", "second")
Organization (Companies, agencies, institutions, etc.)
Percentages (Percentage including "%")
Person (People, including fictional)
Product (Vehicles, weapons, foods, etc. (Not services))
Quantity (Measurements, as of weight or distance)
Time (Times smaller than a day)
Work of Art (Titles of books, songs, etc.)
Finnish
Person, Location, Organization, Miscellaneous
French
Person, Location, Organization, Miscellaneous
German
Person, Location, Organization, Miscellaneous
Italian
Location, Organization, Person
Hungarian
Person, Location, Organization, Miscellaneous
Myanmar
Location, Miscellaneous, Number, Organization, Person, Race, Time
Russian
Person, Location, Organization, Miscellaneous
Spanish
Person, Location, Organization, Miscellaneous
Ukrainian
Person, Location, Organization, Miscellaneous
Vietnamese
Person, Location, Organization, Miscellaneous
Admins can update NER configurations through the ProcessMaker IDP Admin interface.
In the NER configuration the admin can populate the following fields:
Name: Name of the configuration
Json found entities attribute (optional): which attribute will be populated with a JSON array of found entities.
Condition: Enable/disable the usage of this NER configuration for files with the configured language model. The condition acts like a filter, the administrator can provide values which filter using any of the metadata fields of a file. When multiple conditions match for one file, a warning is raised in the logging and the file won't be processed.
Redaction output format: Possible values include:
'Same format as original file' / 'PDF'
'Same format as original file' creates a rendition with the same media type, so a JPG file results in a JPG rendition. When set to 'Pdf', the rendition is converted to PDF.
Overwrite Full Text (optional): In case of anonymization or pseudonymization the full text attribute can be updated by removing the found entities or updated with the configured replacement text.
Language Model: Specifies which language model should be used to recognize entities.
For every NER configuration the admin must specify which entity types should be recognized. Use the exact entity types specified above for recognition via the pretrained model.
Name: The name of the NER configuration. Use a descriptive name, since it's used in the JSON result (when configured).
Redact: When enabled, all found entities this named entity will be redacted/masked. A redacted rendition will only be created when at least of the named entities of a configuration has this field enabled and NER detects one or more entities in the uploaded document.
Replacement Value: When a replacement text is configured, this exact text is used to update text files and when configured, the full text attribute.
Annotation Label: Annotation labels can be used to create annotations for processed documents. The admin can select labels from all annotation schemas. Authorized users have the options to view annotations via the Document action button in the document viewer.
In addition to the named entity recognition via the NER Model, administrators can define these types of regular expressions:
JSON Conditions
Conditions in NER configurations help target specific documents for processing, allowing for efficient management and application of NER rules.
ProcessMaker IDP supports NER in multiple languages, with pre-trained models available for each language. The system can recognize a wide range of entities in various languages, enhancing its usability in a global context.
By leveraging the NER module, ProcessMaker IDP enhances document processing capabilities, ensuring that entities are accurately recognized and managed.
Reprocessing NER Administrators can reprocess a dossier, a specific folder, or individual files through . When the top level is selected for reprocessing, all folders and files within those folders will be processed. This feature is useful when the configuration is changed or the feature needs to be applied to existing content.