IDOL Server stores document text as a series of tokens. Generally, a token is a word, but it can also include other strings of characters (such as a phone number or e-mail address).
During the index process, IDOL Server converts the text into tokens for matching, and stores it in Index fields. The following sections describe how IDOL Server converts text into tokens, and how you can manipulate the results.
IDOL Server processes characters according to their common use, and uses them to define tokens in text.
IDOL Server contains a default set of definitions, which is appropriate in most cases. You might want to change the default definitions if your data contains special strings that you want to search for, such as e-mail addresses. See Example: E-mail Addresses.
In IDOL Server, the following three types of character define tokens and the breaks between tokens:
Character type | Description |
---|---|
Text character | Letters and numbers, including logograph characters from Asian writing systems. |
Separator character | Characters that separate two words, such as spaces, tabs, and line breaks. |
Non-separator character | Other characters, such as punctuation. |
When you index data into IDOL Server:
Text characters | Do not change. |
Separator characters | Become spaces, which mark the break between one token and the next. |
Non-separator characters | Are deleted. If text is separated only by non-separator characters, it becomes a single token. |
Consider the following e-mail address:
joe.smith@example.com
In the default IDOL Server configuration:
The at symbol (@) is a separator character.
The period (.
) is a non-separator character.
When IDOL Server processes the e-mail address, it produces two tokens: JOESMITH
and EXAMPLECOM
, which you can search for to return this document.
However, if you search for JOE
or EXAMPLE
, this document does not return. If you need to be able to search for these terms, you must alter the default configured separator characters. For this example, you define a period (.
) as a separator character.
To add characters to the list of text characters, use the TangibleCharacters
configuration parameter. For example, for indexing social media documents, you might want to be able to include a hashtag (#) or @ symbol. For more information, refer to the IDOL Server Reference.
You can use the additional setting HyphenChars
to index both the subparts and the whole of a hyphenated word.
For example, if you set HyphenChars=.
then the text:
email joe.smith
is tokenized into four terms, EMAIL
, JOE
, SMITH
, and JOESMITH
. Searches for Joe, Joe Smith, Joe.Smith, and joesmith all match the document.
However, the subparts of the hyphenated terms are indexed without a position. This means that you cannot match them with a phrase or proximity query. For example, for the text above, a query for the phrase email joe does not match.
For this reason, Micro Focus recommends that you turn hyphenation off in almost all cases, by setting HyphenChars
to NONE
, and instead use AugmentSeparators
. For example:
AugmentSeparators=.
This configuration matches all the same queries, except for joesmith, which is a rarely used query. The benefits of AugmentSeparators
generally greatly outweigh the value of this type of query.
The NumberPunctuation
option specifies characters that must be treated differently in numeric and alphabetic terms. The setting defines characters to treat as TangibleCharacters
when they appear with a numeric character on either side. For example, if the period (.
) is a separator:
123.com
is tokenized as 123
and COM
.
3.14
is tokenized as 3.14
.
You can set the NumberPunctuation
character individually for different languages.
|