Search engines have evolved during the past 20 years along with the rest of the Information Technologies. Internet search engines have led the path and arguably set the direction of advancement of overall Information Technology. Google has become synonymous with search and brought the search box to every electronic device in use today.
Search engines on the enterprise side have also evolved though at a slower pace compared to Google and moved on from simple text search to multi-dimensional search features. Google has influenced enterprise search engines to deliver features like sponsored links, type-ahead and more like this. We can generalize evolution of enterprise search engines into three generations:
- Search Gen 1 - full text search, manual indexing of data
- Search Gen 2 - feature rich search, connectors for every data source
- Search Gen 3 - convergence of search and big data, integration with distributed storage and distributed computation systems
Gen 1 - examples include Verity K2 platform initial stages, open-source Lucene library
Gen 2 - examples include HP IDOL platform, FAST, open-source Solr, Elasticsearch
Gen 3 - still evolving: Lucidworks Fusion, Cloudera Search
In the case of Gen 2 search engines, commercial engines had initial advantage building upon features of Gen 1, Autonomy acquiring Verity and taking advantage of their keyview filters and ton of connectors integrating into their IDOL platform. Open-source search engines competed with commercial versions and started to match feature to feature largely building upon the Lucene search library. More or less all engines in Gen 2 are equally capable at the moment and everything that can be done with one engine can be done with any other engine by tweaking configuration or modifying source code for finer grain details as needed.
Convergence of Search and Big Data spaces
Hadoop started as distributed data platform by Doug Cutting for supporting Nutch crawler with the ambition of indexing data at internet scale. Over the time Hadoop has sort of separated from search and evolved in its own ecosystem with a multitude of tools for analytics and machine learning. With the emergence of distributed computing platforms like Spark, industry consensus is that there is a convergence happening already in the big data and search space and new platforms built with core capabilities of search and power of distributed computation/storage will emerge in near future. One search product from a reputed vendor Lucidworks is Fusion.
Areas of convergence
Distribued storage using HDFS
Distributed computation
Event streaming
System Architecture
Gen 3 systems are still in early stages with these initial capabilities and would definitely build upon them and come up with new features. Cloudera search is tightly integrated with Cloudera Hadoop Distribution (CDH) and is capable of reading data streams from Flume for near real-time (NRT) data indexing capability. The following diagram from Cloudera website shows where search fits in the CDH:
Lucidworks Fusion platform aligns more closely with enterprise search engines with all the advantages of Cloudera search along with external data sources and extendable APIs. The following diagram from Fusion website shows the components:
Ease of Installation
Fusion is supported on both Windows and Linux platforms. Default installation sets up the following components:
- Zookeeper - to manage all nodes
- Solr
- REST API services
- Admin UI
- Spark
Core Search Features
Pagination
Used for iterating through the results from search engine.
rows -> number of results per page
start -> index to start
Example
&rows=5&start=0
Geo filtering
Used for returning results that are within a specific geographical radius from the given location.
sfield -> defined as "location" field type and value stored as "latitude,longitude" in the document indexed
d -> radius distance in miles
Example
fq={!geofilt pt=47.7189,-117.435 sfield=geoPoint d=5}
Geo sorting
Used to sort results with the closest from the given location.
Example
sort=geodist(geoPoint,47.7189,-117.435) asc
Geo distance
Used to return distance from the given location for each search result.
Example
fl=dist:geodist(geoPoint,47.7189,-117.435)
Result combining
Used to collapse duplicate results and show a single result.
Example
fq={!collapse field=providerGuid}
Filtering
Used to specify additional filters along with the query to search engine.
Example
fq=Masters:DEGREE
Fuzzy search
Fuzzy search allows users to enter mis-spelled search terms and still be able to return relevant results.
Example
q=text~
Phonetic search
Phonetic search allows users to enter search terms which sounds like the actual word and still be able to return relevant results. There are multiple sound algorithms with uses geared towards a specific dataset that can be used. We are using Beider Morse phonetic algorithm here. This algorithm is specified as a filter part of analyzer in the field type defitions. Later during field mapping, the defined field type is associated with the field to which we want sound functionality.
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal -->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal -->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Wildcard
Used to support partial search entered by the users.
Example
q=text*
Faceting
Used to build the filters on results page for the given criteria.
Example
&facet=true&facet.field=color,department&facet.limit=5
Schema
Schema defines how the fields are parsed and indexed during the indexing pipeline and how searches are performed on the fields during search pipeline. Example
Field Mapping
This is the critical step in deploying a search application before indexing any data. It involves comprehensive categorization of all the metadata for the content being indexed and labeling each metadata based on the business functionality expected from the search application.
Example mapping: Field Mapping
This concludes current post detailing search landscape and in-depth functional review for Fusion/Solr platform.
No comments:
Post a Comment