Eduction is the process of finding entities in documents, and typically extracting them to form additional fields in the document. Entities are words, phrases, or blocks of information. Eduction finds entities by using predefined or custom grammars, which use linguistic, regular expression, or term-based rules to define the pattern to match. Common entities include names, addresses, phone numbers, and credit card numbers.
Eduction allows you to automatically extract entities from your documents and add them as metadata to your documents before indexing into IDOL Server. You can extract common search phrases from documents before indexing, and tag the documents with this data. You can then use the tags to make it very quick and easy to search for these common values. It is therefore a useful tool for preprocessing data.
Eduction is most commonly used in IDOL Server as a plug-in module for pre-index processing. It is also available in a stand-alone command-line tool, an API (for the C++ and Java languages), or an ACI server. The command-line tool (edktool) is used for compiling grammars and testing, in addition to extraction.
Eduction has a large and growing collection of standard grammar files, which allow you to quickly extract common entities. These are available for many major languages and geographical locations. You can set up Eduction to extract the information specified by the standard grammars very quickly.
In some cases you must write your own custom grammars, for example, if the standard grammar does not return the expected number of matches. See When to Extend a Grammar
You can build custom grammars from the ground up. However, the standard grammars provide many basic entities that you can reference in your grammars, which allows you to create new custom grammars quickly.
If you reference other entities in an entity that you create, you can use one of the following reference extensions:
(?A^Entity
) During compilation, create a link to the referenced entity from your entity.
(?A:>Entity
) During compilation, add the compiled version of the referenced entity to your entity.
For the first option, compilation is quicker, and the resulting grammar file is a lot smaller. The second option can provide a small performance gain during extraction. Micro Focus recommends that you use the first option in most cases, unless the extraction performance is critical.
Eduction has many configuration options that allow you to change its matching behavior or choose which matches to return. When you configure Eduction, you must carefully consider the kind of data you have, the grammar that you want to use, and the results that you want.
The standard grammar files are provided in a proprietary compiled format, .ECR. The compiled format ensures that entity extraction is very efficient and consistent.
You can also use the edktool command-line tool to compile your custom grammars to .ECR format, which improves the performance of extraction for these grammars. If you do not want to compile grammars, Eduction can use grammars in the original XML format. This method saves you the compilation time, but generally results in slower entity extraction.
|