Thursday, April 8, 2010

Spell-checking in IDOL

 Autonomy provides spell-correction functionality to identify and alert users for mis-spelled words in their queries. This feature can be configured to trigger only when specific conditions are met like:

1) return spell corrections only when the query got less than 5 terms in it (these 5 terms are counted after the stop words are eliminated).
IDOL cfg param: SpellCheckMaxCheckTerms=5
2) return spell corrections only when the term is spelled incorrectly below certain number of documents. This check lets a term to become legitimized once it crosses a threshold number of document occurrences.
IDOL cfg param: SpellCheckIncorrectMaxDocOccs=1000

and for the correction:

1) return a correction only if it occurred in a minimum number of documents. This prevents another mis-spelled term being returned as a suggestion for the original term.
IDOL cfg param: SpellCheckCorrectMinDocOccs=100

Whenever IDOL returns a correction for a mis-spelled term, it stores the info in memory and writes all the corrections when IDOL server is brought down, to a file named "prx.db" under the content's "main" directory. This file is in the xml format and looks like:

<PROXIMALS>
<PROXIMAL ORIG="AGOSTINI">
<PROXIMAL ORIG="AGPM">
<PROXIMAL ORIG="AIBD">
<PROXIMAL ORIG="AIDA" CORRECT="aid">
<PROXIMAL ORIG="AIDAN" CORRECT="aida">
</PROXIMALS>

ORIG - is the term identified by IDOL as mis-spelled
CORRECT - is the suggested correction for it

For entries in the file which do not have the "CORRECT" part, it means the terms will not return any corrections.

This file prx.db can be edited to add/remove specific entries.  Make sure the corrections you are making to the file are valid xml (escape any xml special characters). Also, the term specified in "ORIG" MUST always be in upper-case. For example:

incorrect(will not load): <PROXIMAL ORIG="aidan" CORRECT="aida">
correct: <PROXIMAL ORIG="AIDAN" CORRECT="aida">

The number of entries added to the file should not be more than the "SpellCheckCacheMaxSize" parameter specified in the IDOL server cfg.

After all the back-end setup mentioned above is done, all it takes is to add the parameter "spellcheck=true" to the action=query to get spell-corrections.

No comments: