Wednesday, June 16, 2010

FAST ESP vs Autonomy IDOL : Index and Search process overview

As an Enterprise Search Consultant I have come across the two major Search Platforms in the industry today and often found more similarities than differences between them. Usually, clients choose one over the other after doing a POC in their own environments. Not much information is available on the internet in comparing and contrasting both these systems. This FAST ESP vs Autonomy IDOL series is my attempt to share my knowledge around these platforms and engage in discussions to gain better understanding.

I have chosen the most basic use-case in Enterprise Search: provide a secure search experience to users.
FAST ESP System has the following components:
  • Connector
  • Document Processing Pipeline
  • Security Access Module (SAM)
  • Query and Results processing Server

A specific connector is used to index content from different content sources. For example, a Lotus Notes connector is used for indexing Notes content.
Each Connector is assigned to a single document processing pipeline. Document processing pipeline consists of multiple stages which process the content fed from the connector. Example stages: stemming, entity extraction, short summary generation etc
Security Access Module has the following components:
  • User monitor  - stores the user group information
  • ACL monitor - provides the ACL information for the content
  • Search filter generator - creates the query filter using user and groups info to filter documents
  • Last minute access rights - performs a last min check on the results returned to drop unauthorized results
 Some connectors like Lotus notes connector or Documentum connector feed the User monitor with user and group information directly.
As the content fed from connector passes through the document processing pipeline, for content stores like filesystem, ACL information is pulled for each document using ACL monitor and added as additional metadata to the document. After passing through all the stages the document is added to the binary search index.
Search:

When a search is run from the UI, the front end application adds the userid information to the query and passes it to QR server. QR Server consists of two modules, query processing and results processing. When a search query is passed along to the QR Server it goes through the query processing stages which can be customized and include spellcheck, synonym expansion and importantly security filter generation. In the security filter generation stage, the userid and domain info passed from search UI is sent to the Search filter generator in SAM and Search filter generator in turn communicates with User monitor and returns a security filter which can then be added to the original query.
ESP also provides a last minute security check called from a result processing stage. Though supported by all security types, this is useful for applications handling highly-secured content where real-time security check is required.


Autonomy IDOL System is mostly similar with a few differences. Main components in Autonomy IDOL:
  • Connector
  • Group Server
  • IDOL proxy


Autonomy IDOL system has various connectors for indexing content for example, lotus notes connector, documentum connector etc. IDOL provides a powerful import filters mechanism to process the content indexed by the connectors. Index tasks are a way of adding more metadata or manipulating existing metadata from external systems. After passing through the import jobs and index tasks the document is indexed into a binary index in IDOL server.
Search:


Unlike QR server adding the security filter in ESP, in IDOL applications, the search front-end makes the call to the group server and gets the security filter and adds it to the query sent to the IDOL Server.
Group Server maintains the user and group information for various repositories whose content is indexed into IDOL server. Calling search application passes the userid to the Group server which then generates an encrypted security string merging the user&group info from all repositories using aliasing.

The Role of QR Server and to some extent Document Processing Pipeline in the ESP system is handled directly by the IDOL server. IDOL server provides many configurable parameters which enable fine tuning the system based on user needs.

Thursday, June 3, 2010

Biasing Search Results

Biasing means favoring or manipulating. Biasing search results finds use in many situations. Consider the following few cases:
  1. biasing results which have been rated by users as excellent or above average compared to others
  2. biasing results based on geographic location of the search user
  3. biasing results based on targeted audience ;-)
One of the tools Autonomy IDOL provides to enable biasing is "BIASVAL" operator. Using BIASVAL the relevancy of search results can be manipulated based on certain criteria. For example, Content having country in its metadata can be enabled for biasing using country. BIASVAL is specified as part of the fieldtext query that is sent to IDOL.
fieldtext=BIASVAL{US,10}:COUNTRY  ---> biases content having country metadata set to US by 10%

Biasing can be grouped and applied over multiple metadata to achieve more focused search results.

Bias is especially useful in searches run from Portals where the Portal UI and content is personalised for the user.
Also very useful in the cases where the ACL is not restrictive enough to filter the results using ACL.

Adding bias and creating the fieldtext at run-time using the specified criteria adds a lot of dynamism to the static search queries.

Thursday, April 8, 2010

Spell-checking in IDOL

 Autonomy provides spell-correction functionality to identify and alert users for mis-spelled words in their queries. This feature can be configured to trigger only when specific conditions are met like:

1) return spell corrections only when the query got less than 5 terms in it (these 5 terms are counted after the stop words are eliminated).
IDOL cfg param: SpellCheckMaxCheckTerms=5
2) return spell corrections only when the term is spelled incorrectly below certain number of documents. This check lets a term to become legitimized once it crosses a threshold number of document occurrences.
IDOL cfg param: SpellCheckIncorrectMaxDocOccs=1000

and for the correction:

1) return a correction only if it occurred in a minimum number of documents. This prevents another mis-spelled term being returned as a suggestion for the original term.
IDOL cfg param: SpellCheckCorrectMinDocOccs=100

Whenever IDOL returns a correction for a mis-spelled term, it stores the info in memory and writes all the corrections when IDOL server is brought down, to a file named "prx.db" under the content's "main" directory. This file is in the xml format and looks like:

<PROXIMALS>
<PROXIMAL ORIG="AGOSTINI">
<PROXIMAL ORIG="AGPM">
<PROXIMAL ORIG="AIBD">
<PROXIMAL ORIG="AIDA" CORRECT="aid">
<PROXIMAL ORIG="AIDAN" CORRECT="aida">
</PROXIMALS>

ORIG - is the term identified by IDOL as mis-spelled
CORRECT - is the suggested correction for it

For entries in the file which do not have the "CORRECT" part, it means the terms will not return any corrections.

This file prx.db can be edited to add/remove specific entries.  Make sure the corrections you are making to the file are valid xml (escape any xml special characters). Also, the term specified in "ORIG" MUST always be in upper-case. For example:

incorrect(will not load): <PROXIMAL ORIG="aidan" CORRECT="aida">
correct: <PROXIMAL ORIG="AIDAN" CORRECT="aida">

The number of entries added to the file should not be more than the "SpellCheckCacheMaxSize" parameter specified in the IDOL server cfg.

After all the back-end setup mentioned above is done, all it takes is to add the parameter "spellcheck=true" to the action=query to get spell-corrections.

Thursday, March 4, 2010

Search result summaries in Autonomy IDOL

Autonomy IDOL provides multiple ways to generate summaries for the search results displayed to user. I will list three types and go into details of how they work:

  • Summary from a field

  • Contextual summary

  • Conceptual summary
Summary from a field is the simplest way to generate a summary for a particular document. It is derived from a specific field from the document itself. For example, description of a pdf document, a custom field created during the content authoring or even a set of fields from the content management system.

For content which has good summary added during its creation, this approach fits the best. For content which is a mix of both managed and unmanaged documents this approach fails and those documents which got no description defined when they were created would not have any summary when they come back in search results. Another drawback is a potential lack of highlighting of the search terms in the summary. Since summary is static, it may or may not contain the search terms.

Contextual summary is dynamic summary generated by IDOL when the search results are returned for a particular query. IDOL looks up the search terms in the document and picks sentences which have the highest relevance and also contain the keywords in them. Number of sentences and the number of characters in the summary are parameters to the search query. This approach almost always highlights the search terms in summary as the summary is picked from the location of search terms themselves. Synonyms and stemmed versions of search terms are highlighted as well.

Coming to the drawbacks of this approach, it fails to present the user an overview of the document though it can show the context of the terms user has searched in the actual document. If the content is not massaged properly during the indexing process, the context could be meaningless: for example, search terms present in a table in the document, search terms enclosed in a box created out of # chars, search terms in a header or footer. In these cases, contextual summary would present the dotted lines or underscores or # before and after the terms making the summary not much useful. Careful processing of content during the index process helps avoid these issues.

Third type: Conceptual summary is generated by IDOL by looking at the most prominent terms in the document. IDOL assigns weights to different terms in the document based on their counts and inverse frequency besides applying other statistical algorithms. This approach would be a fallback if #1 and #2 does not yield satisfactory results.

For #2 and #3, IDOL lets you specify which fields in a document are involved in generating the summary.

IDOL lets you specify the fields from which you want to summary to extracted in the IDOL.cfg file:

[FieldProcessing]
Number=20
0=SetSourceFields

....

[SetSourceFields]
// Specify which fields are to be used as the source for suggest, summaries, termgetbest
// If none are specified, it uses the indexfields
Property=SourceFields
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT,

[Properties]

0=SourceFields

[SourceFields]
SourceType=TRUE

----

In the above configuration the fields DRETITLE and DRECONTENT are enabled for summary extraction by IDOL. Any changes to these fields would require a reindex of the content.

Now, while querying the following parameter defines which type of summary you get:

Summary=concept

Summary=context