Friday, May 20, 2016

Solr search customization: building a custom filter


Solr has a lot of bells and whistles to use out of the box for building a robust search platform for a company. But there will always be business use cases where a few customization are needed to achieve the desired results.

In this post, we consider a custom requirement related to name search application which involves names with European and American origins. For example, a name "De Vera Michael" which is an European origin should be returned in search results when someone search for "devera".

This is currently not supported out of the box from any of the Solr analyzers and filters. But we can build our own custom plugin to meet the above requirement and add it to the platform. Solr provides this ability for users to build custom filters which is very powerful and differentiating factor from commercial search engines in the market.

Solr allows to add custom behavior for both index and search operations by manipulating the index/search pipelines defined in the solrconfig.xml. The customization needed to support above requirement is to be able to intercept during the index time for a given field at the character stream level and modify the character stream adding the merged tokens we are looking for before the next stage in the pipeline is called. Because the change we are making is at the core level, we need this executed before tokens are generated from the character stream by the search engine.

The following field-type configuration shows the custom filter "WordConcatenateFilter" added to the pipeline for processing fields of type "text_en":

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">

        <charFilter class="com.rupendra.solr.filter.concatenate.WordConcatenateFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>



Solr plugins use the Factory pattern and we need to create the Filter class and FilterFactory class, package them into a jar file and deploy the jar into one of the directories on the classpath for Solr.

Source code for the filter is available on Github: SolrCustomFilter

If we need a deeper customization at the Lucene codec level, we can build a custom Lucene coded and build Solr using the customized Lucene code and deploy it. Lucene has come a long way in the last 10+ years so the need for this is very remote, but possible.

 

No comments: