I finally had to do give in and implement autosuggest (aka autocomplete, aka predective text, aka look ahead) on our free text search box ala Google. The requirement was simple: implement autosuggest based on previous user typed in free text queries which return some results back (as it would not be a great UX to suggest a term which returns 0 results back). Implementing this required: a. persist user typed in queries somewhere b. write an autosuggest component based on a. and glue it to the UI to return suggestion list to the user.
Part I: Setting up solr index
Since, we already use solr 1.4 and I am not big fan of making database do anything more than just persist/retrieve data, I knew I would rather write my autosuggest component utilizing solr.
As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
1. Use EdgeNGrams
2. Use shingles and prefix query.
3. Use the new Terms component (in solr 1.4, the terms component does not have the regex support, which means it can only do a "begins with" match).
After a bit of contemplation and research, I decided to go with using NGrams. Basically, the decision was made by the process of elimination:
a. shingles & prefix query: I know AOL does this way but unfortunately there's very little documentation online around shingles and how you go about using it; I figured; I would be groping in the dark if I went ahead with this.
b. Terms component: Even after applying the regex patch from JIRA, I felt there is lot more sanitizing of input that needs to be done (things like if the user types in multiple terms like: "joan of", I need to ensure that I replace the whitespace with \s etc). I somehow felt it's easy to break this in case a user types in slightly odd search terms.
Having decided on going the EdgeNGrams route, all I needed was two fields in schema file. Here's how the solr field definition for the "suggest" field looks like:
<fieldtype name="suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="2" maxGramSize="50" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The suggest field holds the NGrams, whereas the query field stores the query verbatim (apart from lowercasing it). This allows us to return the query back to the user in the suggestion list based on what he typed.
Part II: Building the index
Part I of the problem solved, now we have the solr index which would generate the NGrams based on the query that it needs to index and also store the query that would be returned as part of the suggestion list.
Now, we need to figure out how do we build this index i.e. how we get the user typed in queries into this index. Instead of indexing each and every query the moment the user types it in, I implemented a simple lazy writer (in fact two levels of lazy writer). Every time the user types in a query which return results, I store that in a cached dictionary. I hook on to CacheItemRemovedCallback and persist the aggregated queries into a database table, the db table also holds a DateModified timestamp column which is updated to getdate() for newly added items. We then have a offline task which runs on a nightly basis and indexes all the newest queries to the solr index. Part II solved, now we have the solr index built and ready to serve "suggestions".
Part III: gluing it with the presentation layer
There are two bits to the presentation layer:
a. Hooking the solr index to get back the suggestion list as JSON.
b. The presentation itself aka UX.
Since, we already use JQuery on our project I decided to go with Autocomplete plugin for UX. The plugin again lacks in-depth documentation but it wasn't very hard to get it working. Instead of calling the solr index directly from the autocomplete plugin (which would have the serious after-effect of exposing our solr endpoint to the public and also adding solr dependency to the UI layer), I decided to write a WCF REST shim around the solr index. Adding a shim/decorator on top of the solr also gave me a place to intercept the calls to the solr and sanitize the user input (basically escaping the Lucene special characters) and do some heavy duty caching. That's where I had a problem: WCF services in 3.5 don't have caching support built in, you need to install the REST Starter Kit to get the caching support but once I had it installed I was able to get caching working with my WCF REST service.
After thoughts
We haven't rolled out this implementation yet, so I don't know in terms of performance how it's gonna go but at least during my limited testing, I have been able to get quite fast response times. By the way, SOLR-1316 seems quite promising too and I've been watching that thread with quite some interest: it talks about implementing the terms component as a ternary search tree so that should be quite fast.
Tuesday, March 30, 2010
Tuesday, March 23, 2010
solr and indexing Chinese text
So, I've been kinda bugged with trying to get our search functionality upto the mark for the Chinese (I also have the other JK in the CKJ to worry about, but more on that later) language. As some of you would know, the biggest issue with the CJK language is that there are no word boundaries unlike the latin languages. Due to the missing word boundaries, the issue is how do you tokenize the text? There are some brute force ways of tokenizing (tokenize on uni-gram) and some middle of the road approaches (tokenize on bi-gram, which is what the lucene's built in CJKAnalyzer does). We started with the CJKAnalyzer but unfortunately figured out that it just doesn't cut it esp. due to the fact that CJKAnalyzer does not let you search on single character tokens (as it only tokenizes on Bi-grams). So, I decided to bite the bullet and upgrade solr to 1.4 (I guess I should post my experiences on that too some time) which also means that now we have lucene 2.9 to play with. One of the contrib package with Lucene 2.9 is the smartCn analyzer (http://issues.apache.org/jira/browse/LUCENE-1882), and I decided to give it a shot. The smartCn uses a built in dictionary and Hidden Markov model (I have no clue what it means) to "smartly" tokenize the chinese string. SmartCn analyzer did solve the problem of single character searches and might get us through the next release but I still feel it still doesn't solve the problem to my comfort level (the multiple character searches return less result using smartCn than the CJKAnalyzer). I guess, one way to solve it is to seed the built in dictionary with chinese terms which are relevant to our site, unfortunately the built in dictionary seems to be serialized into some binary format from a java class which I am having some hard time to decipher. I need to dig a little more deeper into this and maybe shoot out an email to the contributor of this package to see if there is some easy way of seeding the built in dictionary.
Thursday, March 11, 2010
oops, something when wrong...
So, there's a lotsa buzz about Google Reader Play and I was tempted enough to try it. But when I when (sic) there, all I got was a "oops, something when wrong..retrying" message:
Well, maybe that's what beta means these days...
Well, maybe that's what beta means these days...
Subscribe to:
Posts (Atom)