I finally had to do give in and implement autosuggest (aka autocomplete, aka predective text, aka look ahead) on our free text search box ala Google. The requirement was simple: implement autosuggest based on previous user typed in free text queries which return some results back (as it would not be a great UX to suggest a term which returns 0 results back). Implementing this required: a. persist user typed in queries somewhere b. write an autosuggest component based on a. and glue it to the UI to return suggestion list to the user.
Part I: Setting up solr index
Since, we already use solr 1.4 and I am not big fan of making database do anything more than just persist/retrieve data, I knew I would rather write my autosuggest component utilizing solr.
As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
1. Use EdgeNGrams
2. Use shingles and prefix query.
3. Use the new Terms component (in solr 1.4, the terms component does not have the regex support, which means it can only do a "begins with" match).
After a bit of contemplation and research, I decided to go with using NGrams. Basically, the decision was made by the process of elimination:
a. shingles & prefix query: I know AOL does this way but unfortunately there's very little documentation online around shingles and how you go about using it; I figured; I would be groping in the dark if I went ahead with this.
b. Terms component: Even after applying the regex patch from JIRA, I felt there is lot more sanitizing of input that needs to be done (things like if the user types in multiple terms like: "joan of", I need to ensure that I replace the whitespace with \s etc). I somehow felt it's easy to break this in case a user types in slightly odd search terms.
Having decided on going the EdgeNGrams route, all I needed was two fields in schema file. Here's how the solr field definition for the "suggest" field looks like:
<fieldtype name="suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="2" maxGramSize="50" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The suggest field holds the NGrams, whereas the query field stores the query verbatim (apart from lowercasing it). This allows us to return the query back to the user in the suggestion list based on what he typed.
Part II: Building the index
Part I of the problem solved, now we have the solr index which would generate the NGrams based on the query that it needs to index and also store the query that would be returned as part of the suggestion list.
Now, we need to figure out how do we build this index i.e. how we get the user typed in queries into this index. Instead of indexing each and every query the moment the user types it in, I implemented a simple lazy writer (in fact two levels of lazy writer). Every time the user types in a query which return results, I store that in a cached dictionary. I hook on to CacheItemRemovedCallback and persist the aggregated queries into a database table, the db table also holds a DateModified timestamp column which is updated to getdate() for newly added items. We then have a offline task which runs on a nightly basis and indexes all the newest queries to the solr index. Part II solved, now we have the solr index built and ready to serve "suggestions".
Part III: gluing it with the presentation layer
There are two bits to the presentation layer:
a. Hooking the solr index to get back the suggestion list as JSON.
b. The presentation itself aka UX.
Since, we already use JQuery on our project I decided to go with Autocomplete plugin for UX. The plugin again lacks in-depth documentation but it wasn't very hard to get it working. Instead of calling the solr index directly from the autocomplete plugin (which would have the serious after-effect of exposing our solr endpoint to the public and also adding solr dependency to the UI layer), I decided to write a WCF REST shim around the solr index. Adding a shim/decorator on top of the solr also gave me a place to intercept the calls to the solr and sanitize the user input (basically escaping the Lucene special characters) and do some heavy duty caching. That's where I had a problem: WCF services in 3.5 don't have caching support built in, you need to install the REST Starter Kit to get the caching support but once I had it installed I was able to get caching working with my WCF REST service.
After thoughts
We haven't rolled out this implementation yet, so I don't know in terms of performance how it's gonna go but at least during my limited testing, I have been able to get quite fast response times. By the way, SOLR-1316 seems quite promising too and I've been watching that thread with quite some interest: it talks about implementing the terms component as a ternary search tree so that should be quite fast.
No comments:
Post a Comment