IDOL Eduction Grammars
The following section describes the Eduction grammars available in the IDOL Eduction Grammars Package.
You can use these grammars with IDOL Eduction, by using Eduction Server, the edktool command-line utility, or the Eduction SDK. For more information, refer to the IDOL Eduction User and Programming Guide.
IMPORTANT: To use the grammars in the IDOL Eduction Grammars Package, you must have a license that enables them. To obtain a license, contact OpenText Support.
The IDOL Eduction Grammars Package includes a default configuration file for each set of grammars, which includes the basic required settings that you need to use the grammars.
NOTE: If you create your own configuration file, you must include some of the settings in the default configuration file. For all grammar sets, you must include post-processing (see Configure Post Processing). For the PII, PHI, and PCI grammar sets, you must also configure Eduction components (see PII Components, PHI Components, and PCI Components).
Configure Post Processing
When you use the IDOL Eduction Grammars Package Eduction grammars it is essential to configure a Lua post-processing task to run the post-processing script for the grammar set, pii_postprocessing.lua
, phi_postprocessing.lua
, pci_postprocessing.lua
, or gov_postprocessing.lua
.
This script contains post-processing to improve results for various entities, such as stop list filtering, and checksum validation (see Validated ID Numbers). For PII grammars, the script also provides entity name mapping for combined grammars (see Combined Entities) and ambiguous landmark detection (see Ambiguous Entities).
IMPORTANT: If you do not run this script, you might encounter unexpected behavior.
The default configuration file provided in the IDOL Eduction Grammars Package includes a suitable post-processing task. If you use a different configuration, you must add the post-processing task to your Eduction configuration. For example:
[Eduction] PostProcessingTask0=MyPostProcessingSection [MyPostProcessingSection] Type=Lua Script=scripts/pii_postprocessing.lua Entities=pii/*,gdpr/*
IMPORTANT: For PII, PHI, and PCI grammars, the post-processing script requires Eduction components. For these grammar sets, the default configuration file enables components. If you use a custom configuration file you must set the EnableComponents
parameter to True
to return components.
An additional post-processing script is available for tabular data, which you might want to configure to improve matching in tables. See Entity Context.
For more information about configuring post-processing tasks, refer to the Eduction User and Programming Guide.
Configure Prefiltering
Prefiltering allows Eduction to run a quick initial check to find potential matches in your input text. It then selects match windows around these potential matches, reducing the amount of text that it must match against your grammars. This process can improve the performance in cases where there is a low density of matches, where it filters out a lot of unmatched text.
For the PII and PHI grammar sets, the IDOL Eduction Grammars Package includes recommended prefilter configuration and dictionary files that cover entities from the name, address, and medical grammars. A general numeric prefilter configuration is also included for use with entities that contain a numeric portion, such as telephone or passport numbers.
NOTE: Prefilter tasks run for all configured entities, so you must configure it only for the appropriate entities to ensure that it does not affect the results for other entities.
For more information about prefiltering, refer to the Eduction User and Programming Guide.
IMPORTANT: To use the DPF files from the 24.4 package, you must use Eduction tools with a version of 12.9 or later.
Entity Context
Some of the entities are available in two versions, with and without context. The context-based entities match the entity when it occurs in an easily identifiable location in text. For example, it might match a telephone number that occurs next to the prefix Phone:.
The entities that do not have context attempt to match the entity wherever it occurs. This version might over-match significantly. That is, it is likely to return values that are similar to the entity patterns, such as a number that is not a telephone number. However, it also reduces the number of false negatives and miss fewer matches.
You can configure Eduction to use both versions of an entity; matches located with context are given a higher score in the results.
Tabular Data
When you have data in tables, the context for an entity might not occur next to the entity value. For example, you might have a table with columns titled name and date of birth, but the values themselves do not occur next to these headers.
In this case, you can use Eduction table extraction to extract entities according to the landmarks detected in the table headers. For example, you can configure Eduction so that if it finds a table heading that matches the landmark date of birth, it extracts dates from that column.
Table extraction works best in combination with the table output from IDOL KeyView filtering, which extracts tables into a format that is easy for Eduction to process. It can also work with simple CSV and TSV format tables.
For more information about how to configure table extraction, refer to the Eduction User and Programming Guide.
TIP: For table extraction, a post-processing script table_boost.lua
is included in the script
directory. When you configure Eduction to perform extraction from table data, this script increases the score of matches found in a table body that have an associated header landmark.
For the PII, PHI, and PCI grammar sets, the config
directory in the grammar package also contains a sample configuration file to illustrate how to set up table-mode matching and configure post-processing with this script, pii_table.cfg
, phi_table.cfg
, and pci_table.cfg
.
For the Government grammar set, you can use table extraction only for entities that have landmarks, such as gov/number/export/eccn/
.
Grammar Metadata
The grammar files in the IDOL Eduction Grammars Package come with metadata JSON files, which include some information about the grammars, such as supported countries and languages. You can use these metadata files to make it easier to select entities when you configure Eduction. For example, you might want to use them to create a user interface where your end users can select entities according to which languages it covers.
Each grammar has its own metadata file, which covers all entities for that grammar. These metadata files are included in the IDOL Eduction Grammars Package alongside the relevant grammar file. A metadata_schema.json
file is also available, which contains the schema for the metadata JSON files.
PII Grammar ECR and EJR Files
IMPORTANT: To use the EJR grammar files from the 24.4 package, you must use Eduction tools with a version of 12.9 or later.
Some grammars in the PII grammar set are available in two formats, ECR and EJR. In these cases, both formats contain the same entities for extraction, and the format that you use depends on your input data.
EJR files are performance-optimized for cases where the expected match density in your input text is low. OpenText recommends that you use EJR files when you expect less than 10% of the input text to be valid matches. In all other cases, use the ECR files.
When you use EJR grammars, you must run them in a separate matching engine to any ECR grammars, although you can run multiple EJR grammars in the same engine.
For example, the following configuration is allowed:
ResourceFiles=passport.ejr,date.ejr
You cannot set ResourceFiles=passport.ejr,date.ecr
.