The set of rules that describe a type of entity is known as a grammar. Some grammars contain a dictionary of terms. Others contain an expression that defines the pattern that the type of entity follows.
IDOL has a large number of these grammars as standard, which allow you to extract common entities. You can also add custom items to the available grammars, or add a custom grammar of your own.
Eduction has a large and expanding set of standard grammar files that includes entities to extract:
place names
person names
company names
legal terms
credit card numbers
social security numbers
phone numbers
addresses
dates
times
Internet addresses
weights and measurements
It includes all major languages and geographical locations. You can easily set up Eduction to extract the information supported by the standard grammars.
You can create your own grammar files to define entities that you want to extract. The Eduction grammar file uses standard XML formats with a simple document type definition. You can define entities using UNIX-like regular expressions and Eduction-specific operations and extensions.
You can build up complex entities by referencing existing entities. The standard grammar files therefore offer a large collection of resources that you can build on to meet your own needs. You can extend and replace the entities in standard grammar files with your own entities.
|