Friday, May 20, 2016

Solr search customization: building a custom filter


Solr has a lot of bells and whistles to use out of the box for building a robust search platform for a company. But there will always be business use cases where a few customization are needed to achieve the desired results.

In this post, we consider a custom requirement related to name search application which involves names with European and American origins. For example, a name "De Vera Michael" which is an European origin should be returned in search results when someone search for "devera".

This is currently not supported out of the box from any of the Solr analyzers and filters. But we can build our own custom plugin to meet the above requirement and add it to the platform. Solr provides this ability for users to build custom filters which is very powerful and differentiating factor from commercial search engines in the market.

Solr allows to add custom behavior for both index and search operations by manipulating the index/search pipelines defined in the solrconfig.xml. The customization needed to support above requirement is to be able to intercept during the index time for a given field at the character stream level and modify the character stream adding the merged tokens we are looking for before the next stage in the pipeline is called. Because the change we are making is at the core level, we need this executed before tokens are generated from the character stream by the search engine.

The following field-type configuration shows the custom filter "WordConcatenateFilter" added to the pipeline for processing fields of type "text_en":

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">

        <charFilter class="com.rupendra.solr.filter.concatenate.WordConcatenateFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>



Solr plugins use the Factory pattern and we need to create the Filter class and FilterFactory class, package them into a jar file and deploy the jar into one of the directories on the classpath for Solr.

Source code for the filter is available on Github: SolrCustomFilter

If we need a deeper customization at the Lucene codec level, we can build a custom Lucene coded and build Solr using the customized Lucene code and deploy it. Lucene has come a long way in the last 10+ years so the need for this is very remote, but possible.

 

Search 2.0


Search engines have evolved during the past 20 years along with the rest of the Information Technologies. Internet search engines have led the path and arguably set the direction of advancement of overall Information Technology. Google has become synonymous with search and brought the search box to every electronic device in use today.

Search engines on the enterprise side have also evolved though at a slower pace compared to Google and moved on from simple text search to multi-dimensional search features. Google has influenced enterprise search engines to deliver features like sponsored links, type-ahead and more like this. We can generalize evolution of enterprise search engines into three generations:

  • Search Gen 1 - full text search, manual indexing of data
  • Search Gen 2 - feature rich search, connectors for every data source
  • Search Gen 3 - convergence of search and big data, integration with distributed storage and distributed computation systems

Gen 1 - examples include Verity K2 platform initial stages, open-source Lucene library
Gen 2 - examples include HP IDOL platform, FAST, open-source Solr, Elasticsearch
Gen 3 - still evolving: Lucidworks Fusion, Cloudera Search

In the case of Gen 2 search engines, commercial engines had initial advantage building upon features of Gen 1, Autonomy acquiring Verity and taking advantage of their keyview filters and ton of connectors integrating into their IDOL platform. Open-source search engines competed with commercial versions and started to match feature to feature largely building upon the Lucene search library. More or less all engines in Gen 2 are equally capable at the moment and everything that can be done with one engine can be done with any other engine by tweaking configuration or modifying source code for finer grain details as needed.

Convergence of Search and Big Data spaces

Hadoop started as distributed data platform by Doug Cutting for supporting Nutch crawler with the ambition of indexing data at internet scale. Over the time Hadoop has sort of separated from search and evolved in its own ecosystem with a multitude of tools for analytics and machine learning. With the emergence of distributed computing platforms like Spark, industry consensus is that there is a convergence happening already in the big data and search space and new platforms built with core capabilities of search and power of distributed computation/storage will emerge in near future. One search product from a reputed vendor Lucidworks is Fusion.

Areas of convergence

Distribued storage using HDFS
Distributed computation
Event streaming

System Architecture

Gen 3 systems are still in early stages with these initial capabilities and would definitely build upon them and come up with new features. Cloudera search is tightly integrated with Cloudera Hadoop Distribution (CDH) and is capable of reading data streams from Flume for near real-time (NRT) data indexing capability. The following diagram from Cloudera website shows where search fits in the CDH:




Lucidworks Fusion platform aligns more closely with enterprise search engines with all the advantages of Cloudera search along with external data sources and extendable APIs. The following diagram from Fusion website shows the components:


Ease of Installation

Fusion is supported on both Windows and Linux platforms. Default installation sets up the following components:
  • Zookeeper - to manage all nodes
  • Solr
  • REST API services
  • Admin UI
  • Spark
architecture diagram from Fusion website:



Core Search Features

Pagination 

Used for iterating through the results from search engine.

rows -> number of results per page
start -> index to start

Example
&rows=5&start=0

Geo filtering

Used for returning results that are within a specific geographical radius from the given location.

sfield -> defined as "location" field type and value stored as "latitude,longitude" in the document indexed
d -> radius distance in miles

Example
fq={!geofilt pt=47.7189,-117.435 sfield=geoPoint d=5}

Geo sorting

Used to sort results with the closest from the given location.

Example
sort=geodist(geoPoint,47.7189,-117.435) asc

Geo distance

Used to return distance from the given location for each search result.

Example
fl=dist:geodist(geoPoint,47.7189,-117.435)

Result combining

Used to collapse duplicate results and show a single result.

Example
fq={!collapse field=providerGuid}

Filtering

Used to specify additional filters along with the query to search engine.

Example
fq=Masters:DEGREE

Fuzzy search

Fuzzy search allows users to enter mis-spelled search terms and still be able to return relevant results.

Example
q=text~

Phonetic search

Phonetic search allows users to enter search terms which sounds like the actual word and still be able to return relevant results. There are multiple sound algorithms with uses geared towards a specific dataset that can be used. We are using Beider Morse phonetic algorithm here. This algorithm is specified as a filter part of analyzer in the field type defitions. Later during field mapping, the defined field type is associated with the field to which we want sound functionality.


    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
         -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>


Wildcard 

Used to support partial search entered by the users.

Example
q=text*

Faceting

Used to build the filters on results page for the given criteria.

Example
&facet=true&facet.field=color,department&facet.limit=5

Schema

Schema defines how the fields are parsed and indexed during the indexing pipeline and how searches are performed on the fields during search pipeline. Example 

Field Mapping

This is the critical step in deploying a search application before indexing any data. It involves comprehensive categorization of all the metadata for the content being indexed and labeling each metadata based on the business functionality expected from the search application.

Example mapping: Field Mapping


This concludes current post detailing search landscape and in-depth functional review for Fusion/Solr platform.