Friday, May 20, 2016

Solr search customization: building a custom filter


Solr has a lot of bells and whistles to use out of the box for building a robust search platform for a company. But there will always be business use cases where a few customization are needed to achieve the desired results.

In this post, we consider a custom requirement related to name search application which involves names with European and American origins. For example, a name "De Vera Michael" which is an European origin should be returned in search results when someone search for "devera".

This is currently not supported out of the box from any of the Solr analyzers and filters. But we can build our own custom plugin to meet the above requirement and add it to the platform. Solr provides this ability for users to build custom filters which is very powerful and differentiating factor from commercial search engines in the market.

Solr allows to add custom behavior for both index and search operations by manipulating the index/search pipelines defined in the solrconfig.xml. The customization needed to support above requirement is to be able to intercept during the index time for a given field at the character stream level and modify the character stream adding the merged tokens we are looking for before the next stage in the pipeline is called. Because the change we are making is at the core level, we need this executed before tokens are generated from the character stream by the search engine.

The following field-type configuration shows the custom filter "WordConcatenateFilter" added to the pipeline for processing fields of type "text_en":

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">

        <charFilter class="com.rupendra.solr.filter.concatenate.WordConcatenateFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>



Solr plugins use the Factory pattern and we need to create the Filter class and FilterFactory class, package them into a jar file and deploy the jar into one of the directories on the classpath for Solr.

Source code for the filter is available on Github: SolrCustomFilter

If we need a deeper customization at the Lucene codec level, we can build a custom Lucene coded and build Solr using the customized Lucene code and deploy it. Lucene has come a long way in the last 10+ years so the need for this is very remote, but possible.

 

Search 2.0


Search engines have evolved during the past 20 years along with the rest of the Information Technologies. Internet search engines have led the path and arguably set the direction of advancement of overall Information Technology. Google has become synonymous with search and brought the search box to every electronic device in use today.

Search engines on the enterprise side have also evolved though at a slower pace compared to Google and moved on from simple text search to multi-dimensional search features. Google has influenced enterprise search engines to deliver features like sponsored links, type-ahead and more like this. We can generalize evolution of enterprise search engines into three generations:

  • Search Gen 1 - full text search, manual indexing of data
  • Search Gen 2 - feature rich search, connectors for every data source
  • Search Gen 3 - convergence of search and big data, integration with distributed storage and distributed computation systems

Gen 1 - examples include Verity K2 platform initial stages, open-source Lucene library
Gen 2 - examples include HP IDOL platform, FAST, open-source Solr, Elasticsearch
Gen 3 - still evolving: Lucidworks Fusion, Cloudera Search

In the case of Gen 2 search engines, commercial engines had initial advantage building upon features of Gen 1, Autonomy acquiring Verity and taking advantage of their keyview filters and ton of connectors integrating into their IDOL platform. Open-source search engines competed with commercial versions and started to match feature to feature largely building upon the Lucene search library. More or less all engines in Gen 2 are equally capable at the moment and everything that can be done with one engine can be done with any other engine by tweaking configuration or modifying source code for finer grain details as needed.

Convergence of Search and Big Data spaces

Hadoop started as distributed data platform by Doug Cutting for supporting Nutch crawler with the ambition of indexing data at internet scale. Over the time Hadoop has sort of separated from search and evolved in its own ecosystem with a multitude of tools for analytics and machine learning. With the emergence of distributed computing platforms like Spark, industry consensus is that there is a convergence happening already in the big data and search space and new platforms built with core capabilities of search and power of distributed computation/storage will emerge in near future. One search product from a reputed vendor Lucidworks is Fusion.

Areas of convergence

Distribued storage using HDFS
Distributed computation
Event streaming

System Architecture

Gen 3 systems are still in early stages with these initial capabilities and would definitely build upon them and come up with new features. Cloudera search is tightly integrated with Cloudera Hadoop Distribution (CDH) and is capable of reading data streams from Flume for near real-time (NRT) data indexing capability. The following diagram from Cloudera website shows where search fits in the CDH:




Lucidworks Fusion platform aligns more closely with enterprise search engines with all the advantages of Cloudera search along with external data sources and extendable APIs. The following diagram from Fusion website shows the components:


Ease of Installation

Fusion is supported on both Windows and Linux platforms. Default installation sets up the following components:
  • Zookeeper - to manage all nodes
  • Solr
  • REST API services
  • Admin UI
  • Spark
architecture diagram from Fusion website:



Core Search Features

Pagination 

Used for iterating through the results from search engine.

rows -> number of results per page
start -> index to start

Example
&rows=5&start=0

Geo filtering

Used for returning results that are within a specific geographical radius from the given location.

sfield -> defined as "location" field type and value stored as "latitude,longitude" in the document indexed
d -> radius distance in miles

Example
fq={!geofilt pt=47.7189,-117.435 sfield=geoPoint d=5}

Geo sorting

Used to sort results with the closest from the given location.

Example
sort=geodist(geoPoint,47.7189,-117.435) asc

Geo distance

Used to return distance from the given location for each search result.

Example
fl=dist:geodist(geoPoint,47.7189,-117.435)

Result combining

Used to collapse duplicate results and show a single result.

Example
fq={!collapse field=providerGuid}

Filtering

Used to specify additional filters along with the query to search engine.

Example
fq=Masters:DEGREE

Fuzzy search

Fuzzy search allows users to enter mis-spelled search terms and still be able to return relevant results.

Example
q=text~

Phonetic search

Phonetic search allows users to enter search terms which sounds like the actual word and still be able to return relevant results. There are multiple sound algorithms with uses geared towards a specific dataset that can be used. We are using Beider Morse phonetic algorithm here. This algorithm is specified as a filter part of analyzer in the field type defitions. Later during field mapping, the defined field type is associated with the field to which we want sound functionality.


    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
         -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>


Wildcard 

Used to support partial search entered by the users.

Example
q=text*

Faceting

Used to build the filters on results page for the given criteria.

Example
&facet=true&facet.field=color,department&facet.limit=5

Schema

Schema defines how the fields are parsed and indexed during the indexing pipeline and how searches are performed on the fields during search pipeline. Example 

Field Mapping

This is the critical step in deploying a search application before indexing any data. It involves comprehensive categorization of all the metadata for the content being indexed and labeling each metadata based on the business functionality expected from the search application.

Example mapping: Field Mapping


This concludes current post detailing search landscape and in-depth functional review for Fusion/Solr platform.

Wednesday, June 16, 2010

FAST ESP vs Autonomy IDOL : Index and Search process overview

As an Enterprise Search Consultant I have come across the two major Search Platforms in the industry today and often found more similarities than differences between them. Usually, clients choose one over the other after doing a POC in their own environments. Not much information is available on the internet in comparing and contrasting both these systems. This FAST ESP vs Autonomy IDOL series is my attempt to share my knowledge around these platforms and engage in discussions to gain better understanding.

I have chosen the most basic use-case in Enterprise Search: provide a secure search experience to users.
FAST ESP System has the following components:
  • Connector
  • Document Processing Pipeline
  • Security Access Module (SAM)
  • Query and Results processing Server

A specific connector is used to index content from different content sources. For example, a Lotus Notes connector is used for indexing Notes content.
Each Connector is assigned to a single document processing pipeline. Document processing pipeline consists of multiple stages which process the content fed from the connector. Example stages: stemming, entity extraction, short summary generation etc
Security Access Module has the following components:
  • User monitor  - stores the user group information
  • ACL monitor - provides the ACL information for the content
  • Search filter generator - creates the query filter using user and groups info to filter documents
  • Last minute access rights - performs a last min check on the results returned to drop unauthorized results
 Some connectors like Lotus notes connector or Documentum connector feed the User monitor with user and group information directly.
As the content fed from connector passes through the document processing pipeline, for content stores like filesystem, ACL information is pulled for each document using ACL monitor and added as additional metadata to the document. After passing through all the stages the document is added to the binary search index.
Search:

When a search is run from the UI, the front end application adds the userid information to the query and passes it to QR server. QR Server consists of two modules, query processing and results processing. When a search query is passed along to the QR Server it goes through the query processing stages which can be customized and include spellcheck, synonym expansion and importantly security filter generation. In the security filter generation stage, the userid and domain info passed from search UI is sent to the Search filter generator in SAM and Search filter generator in turn communicates with User monitor and returns a security filter which can then be added to the original query.
ESP also provides a last minute security check called from a result processing stage. Though supported by all security types, this is useful for applications handling highly-secured content where real-time security check is required.


Autonomy IDOL System is mostly similar with a few differences. Main components in Autonomy IDOL:
  • Connector
  • Group Server
  • IDOL proxy


Autonomy IDOL system has various connectors for indexing content for example, lotus notes connector, documentum connector etc. IDOL provides a powerful import filters mechanism to process the content indexed by the connectors. Index tasks are a way of adding more metadata or manipulating existing metadata from external systems. After passing through the import jobs and index tasks the document is indexed into a binary index in IDOL server.
Search:


Unlike QR server adding the security filter in ESP, in IDOL applications, the search front-end makes the call to the group server and gets the security filter and adds it to the query sent to the IDOL Server.
Group Server maintains the user and group information for various repositories whose content is indexed into IDOL server. Calling search application passes the userid to the Group server which then generates an encrypted security string merging the user&group info from all repositories using aliasing.

The Role of QR Server and to some extent Document Processing Pipeline in the ESP system is handled directly by the IDOL server. IDOL server provides many configurable parameters which enable fine tuning the system based on user needs.

Thursday, June 3, 2010

Biasing Search Results

Biasing means favoring or manipulating. Biasing search results finds use in many situations. Consider the following few cases:
  1. biasing results which have been rated by users as excellent or above average compared to others
  2. biasing results based on geographic location of the search user
  3. biasing results based on targeted audience ;-)
One of the tools Autonomy IDOL provides to enable biasing is "BIASVAL" operator. Using BIASVAL the relevancy of search results can be manipulated based on certain criteria. For example, Content having country in its metadata can be enabled for biasing using country. BIASVAL is specified as part of the fieldtext query that is sent to IDOL.
fieldtext=BIASVAL{US,10}:COUNTRY  ---> biases content having country metadata set to US by 10%

Biasing can be grouped and applied over multiple metadata to achieve more focused search results.

Bias is especially useful in searches run from Portals where the Portal UI and content is personalised for the user.
Also very useful in the cases where the ACL is not restrictive enough to filter the results using ACL.

Adding bias and creating the fieldtext at run-time using the specified criteria adds a lot of dynamism to the static search queries.

Thursday, April 8, 2010

Spell-checking in IDOL

 Autonomy provides spell-correction functionality to identify and alert users for mis-spelled words in their queries. This feature can be configured to trigger only when specific conditions are met like:

1) return spell corrections only when the query got less than 5 terms in it (these 5 terms are counted after the stop words are eliminated).
IDOL cfg param: SpellCheckMaxCheckTerms=5
2) return spell corrections only when the term is spelled incorrectly below certain number of documents. This check lets a term to become legitimized once it crosses a threshold number of document occurrences.
IDOL cfg param: SpellCheckIncorrectMaxDocOccs=1000

and for the correction:

1) return a correction only if it occurred in a minimum number of documents. This prevents another mis-spelled term being returned as a suggestion for the original term.
IDOL cfg param: SpellCheckCorrectMinDocOccs=100

Whenever IDOL returns a correction for a mis-spelled term, it stores the info in memory and writes all the corrections when IDOL server is brought down, to a file named "prx.db" under the content's "main" directory. This file is in the xml format and looks like:

<PROXIMALS>
<PROXIMAL ORIG="AGOSTINI">
<PROXIMAL ORIG="AGPM">
<PROXIMAL ORIG="AIBD">
<PROXIMAL ORIG="AIDA" CORRECT="aid">
<PROXIMAL ORIG="AIDAN" CORRECT="aida">
</PROXIMALS>

ORIG - is the term identified by IDOL as mis-spelled
CORRECT - is the suggested correction for it

For entries in the file which do not have the "CORRECT" part, it means the terms will not return any corrections.

This file prx.db can be edited to add/remove specific entries.  Make sure the corrections you are making to the file are valid xml (escape any xml special characters). Also, the term specified in "ORIG" MUST always be in upper-case. For example:

incorrect(will not load): <PROXIMAL ORIG="aidan" CORRECT="aida">
correct: <PROXIMAL ORIG="AIDAN" CORRECT="aida">

The number of entries added to the file should not be more than the "SpellCheckCacheMaxSize" parameter specified in the IDOL server cfg.

After all the back-end setup mentioned above is done, all it takes is to add the parameter "spellcheck=true" to the action=query to get spell-corrections.

Thursday, March 4, 2010

Search result summaries in Autonomy IDOL

Autonomy IDOL provides multiple ways to generate summaries for the search results displayed to user. I will list three types and go into details of how they work:

  • Summary from a field

  • Contextual summary

  • Conceptual summary
Summary from a field is the simplest way to generate a summary for a particular document. It is derived from a specific field from the document itself. For example, description of a pdf document, a custom field created during the content authoring or even a set of fields from the content management system.

For content which has good summary added during its creation, this approach fits the best. For content which is a mix of both managed and unmanaged documents this approach fails and those documents which got no description defined when they were created would not have any summary when they come back in search results. Another drawback is a potential lack of highlighting of the search terms in the summary. Since summary is static, it may or may not contain the search terms.

Contextual summary is dynamic summary generated by IDOL when the search results are returned for a particular query. IDOL looks up the search terms in the document and picks sentences which have the highest relevance and also contain the keywords in them. Number of sentences and the number of characters in the summary are parameters to the search query. This approach almost always highlights the search terms in summary as the summary is picked from the location of search terms themselves. Synonyms and stemmed versions of search terms are highlighted as well.

Coming to the drawbacks of this approach, it fails to present the user an overview of the document though it can show the context of the terms user has searched in the actual document. If the content is not massaged properly during the indexing process, the context could be meaningless: for example, search terms present in a table in the document, search terms enclosed in a box created out of # chars, search terms in a header or footer. In these cases, contextual summary would present the dotted lines or underscores or # before and after the terms making the summary not much useful. Careful processing of content during the index process helps avoid these issues.

Third type: Conceptual summary is generated by IDOL by looking at the most prominent terms in the document. IDOL assigns weights to different terms in the document based on their counts and inverse frequency besides applying other statistical algorithms. This approach would be a fallback if #1 and #2 does not yield satisfactory results.

For #2 and #3, IDOL lets you specify which fields in a document are involved in generating the summary.

IDOL lets you specify the fields from which you want to summary to extracted in the IDOL.cfg file:

[FieldProcessing]
Number=20
0=SetSourceFields

....

[SetSourceFields]
// Specify which fields are to be used as the source for suggest, summaries, termgetbest
// If none are specified, it uses the indexfields
Property=SourceFields
PropertyFieldCSVs=*/DRETITLE,*/DRECONTENT,

[Properties]

0=SourceFields

[SourceFields]
SourceType=TRUE

----

In the above configuration the fields DRETITLE and DRECONTENT are enabled for summary extraction by IDOL. Any changes to these fields would require a reindex of the content.

Now, while querying the following parameter defines which type of summary you get:

Summary=concept

Summary=context


Saturday, June 9, 2007

Welcome

Welcome to Search Trends!

This blog is created to share ideas about enterprise search. Current trends, market experiences, future roadmaps, technical aspects of various search engines like Autonomy, FAST, Google, Endeca, Omnifind, Lucene and others.