Thursday, March 4, 2010

Search result summaries in Autonomy IDOL

Autonomy IDOL provides multiple ways to generate summaries for the search results displayed to user. I will list three types and go into details of how they work:

  • Summary from a field

  • Contextual summary

  • Conceptual summary
Summary from a field is the simplest way to generate a summary for a particular document. It is derived from a specific field from the document itself. For example, description of a pdf document, a custom field created during the content authoring or even a set of fields from the content management system.

For content which has good summary added during its creation, this approach fits the best. For content which is a mix of both managed and unmanaged documents this approach fails and those documents which got no description defined when they were created would not have any summary when they come back in search results. Another drawback is a potential lack of highlighting of the search terms in the summary. Since summary is static, it may or may not contain the search terms.

Contextual summary is dynamic summary generated by IDOL when the search results are returned for a particular query. IDOL looks up the search terms in the document and picks sentences which have the highest relevance and also contain the keywords in them. Number of sentences and the number of characters in the summary are parameters to the search query. This approach almost always highlights the search terms in summary as the summary is picked from the location of search terms themselves. Synonyms and stemmed versions of search terms are highlighted as well.

Coming to the drawbacks of this approach, it fails to present the user an overview of the document though it can show the context of the terms user has searched in the actual document. If the content is not massaged properly during the indexing process, the context could be meaningless: for example, search terms present in a table in the document, search terms enclosed in a box created out of # chars, search terms in a header or footer. In these cases, contextual summary would present the dotted lines or underscores or # before and after the terms making the summary not much useful. Careful processing of content during the index process helps avoid these issues.

Third type: Conceptual summary is generated by IDOL by looking at the most prominent terms in the document. IDOL assigns weights to different terms in the document based on their counts and inverse frequency besides applying other statistical algorithms. This approach would be a fallback if #1 and #2 does not yield satisfactory results.

For #2 and #3, IDOL lets you specify which fields in a document are involved in generating the summary.

IDOL lets you specify the fields from which you want to summary to extracted in the IDOL.cfg file:

[FieldProcessing]
Number=20
0=SetSourceFields

....

[SetSourceFields]
// Specify which fields are to be used as the source for suggest, summaries, termgetbest
// If none are specified, it uses the indexfields
Property=SourceFields
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT,

[Properties]

0=SourceFields

[SourceFields]
SourceType=TRUE

----

In the above configuration the fields DRETITLE and DRECONTENT are enabled for summary extraction by IDOL. Any changes to these fields would require a reindex of the content.

Now, while querying the following parameter defines which type of summary you get:

Summary=concept

Summary=context


No comments: