Once upon a time...: 2010

Wednesday, August 18, 2010

View On Black flickr bookmarklet

I’ll admit, I’m a fan of flickr and upload all of my pictures over there. One thing that I don’t like about flickr though is the default white background which unfortunately can make the photos look a little dull. So, generally I would include a link in the description to http://www.bighugelabs.com/onblack.php so that the viewer have an option to view the photos on black. Unfortunately, this meant that every time I upload a photo, I had to manually copy over the link to bighugelabs into the description. Since I am a sucker for redoing same stuff over and over again, I decided to automate the entire process and ended up writing a simple javascript bookmarklet. Now all I need to do is to navigate to my photo details page and just click on the bookmarklet and voila, the link to “onblack” version is automatically added in the description. Below is the bookmarklet just in case you are also a fan of “onblack”:

javascript:var%20evt%20=%20document.createEvent("MouseEvents");%20evt.initMouseEvent("click",%20true,%20true,%20window,%200,%200,%200,%200,%200,%20false,%20false,%20false,%20false,%200,%20null);%20var%20cb%20=%20document.getElementById("meta").getElementsByTagName("div")[0];cb.dispatchEvent(evt);var%20reg=/\/CHANGE-THIS\/(\d+)\//;var%20y=window.location.href.match(reg)[1];document.getElementById("meta").getElementsByTagName("div")[0].getElementsByTagName("textarea")[0].value='<a %20href="http://www.bighugelabs.com/onblack.php?size=large&id='%20+%20y%20+%20'">View%20On%20Black</a>';document.getElementById("meta").getElementsByTagName("button")[0].click();void(null);

The CHANGE-THIS should be changed to your flickr's NID or your flickr friendly Url i.e. whatever you see in the address bar of your flickr photostream for e.g. if your flickr photo url is http://www.flickr.com/photos/xyz you should replace CHANGE-THIS to xyz. Do keep in mind that this bookmarklet will replace whatever is there in the description field with just a link to “OnBlack” version with the text “VIew On Black”. Also, I have only tried this on Firefox and it works in almost all the scenarios that I tried.

Sunday, May 16, 2010

Contextual spelling suggestions using solr SpellCheckComponent

One of the problems that I faced while trying to implement the spelling suggestions (or "Did you mean") in solr was with phrase queries. If a user types in multiple terms and say one or more of the terms are mistyped, solr provides suggestions on the individual terms (using edit distances) in isolation; and then collates the results of the top most suggestions and returns back as spellcheck.collation. It's all dandy if the user just types in 1 search term or if your searches are OR searches but in case you do an AND search (or some other form of AND using the dismax), there might be instances when the spellchecker's collation actually returns 0 results; and thus resulting in a bad user experience. Let me try to explain this with an example:
Assume that the user types in "chicen fayita sadwich" and you have just two documents in your index:

Doc 1 ==> chicken fajita
Doc 2==> veg sandwich
Now, since solr/lucene treat the terms in isolation what you get in spellcheck.collation is:
"chicken fajita sandwich". Unfortunately, if you have a AND search and the user does click on this "Did you mean" link, it would result in a query (chicken AND fajita AND sandwich) resulting in zero results.
So, how do you solve this? One way to solve this is to use Shingles for creating your "spelling corpus", the only trouble with Shingles is that the number of suggestions generated is bound by the "MaxShinglesSize" and thus if you set your MaxShingleSize to say 4, you only get suggestions up to 4 terms.
Another cleaner albeit slightly slower approach is to extend the solr.SpellCheckComponent, and fire another solr query on the collation itself; if the suggestions' results are greater than 0 (or some other threshold), you return the collation back else try with the second set of suggestion (or just blank out the collation in case you don't want to fire multiple queries). This is what I did (though I am not 100% sure if this is the right way of firing solr queries): extend Solr.SpellCheckComponent, & hook onto the processRequest method to get the handle of ResponseBuilder object (so that you can then get SolrRequest and SolrSearcher objects from this). Override toNamedList method and in that get the collation string, fire another solr query using SolrSearcher and check the results' count; if the suggestion.Count > originalQuery.Count * THRESHOLD, let the collation be as is else blank it out. The guts of this is the overridden toNamedList method, which is below in case somebody is interested:

    protected NamedList toNamedList(SpellingResult spellingResult, String 
origQuery, 
            boolean extendedResults, boolean collate) 
    {
        NamedList result = super.toNamedList(spellingResult, origQuery, 
extendedResults, collate);
        if(collate){
            String collation = (String) result.get("collation");
            if(collation!=null && collation.length() > 0 && builder!=null){
                //fire a query and get the results
                try {
                    //only add spelling suggestion in case results are less than 
some threshold
                    int hits = builder.getResults().docList.matches();
                    if(hits>MIN_THRESHOLD){
                        result.remove("collation");
                        //result.add("collation", "");
                        return result;
                    }
                    SolrIndexSearcher searcher = builder.req.getSearcher();
                    QParser qp = QParser.getParser(collation, "dismax", 
builder.req);
                    NamedList params = new NamedList();
                    params.add("rows", 0);
                    params.add("omitHeader","true");
                    SolrParams localParams = SolrParams.toSolrParams(params);
                    qp.setLocalParams(localParams);
                    Query q = qp.getQuery();
                    TopDocs docs = searcher.search(q, 1);
                    int suggestionHits = docs.totalHits;
                    //try to get hits for this query
                    log.info("current hits:" + hits);
                    log.info("total number of hits:" + suggestionHits);
                    if(suggestionHits <= hits*MULTIPLIER){
                        //remove the collation
                        result.remove("collation");
                        //result.add("collation", "");
                    }
                } catch (IOException e) {
                    log.error(e.toString());
                }
                catch (ParseException e) {
                    log.error(e.toString());
                }                
            }
        }
        return result;
    }

Tuesday, March 30, 2010

Implementing Autosuggest in ASP.Net using WCF Rest service, JQuery and solr

I finally had to do give in and implement autosuggest (aka autocomplete, aka predective text, aka look ahead) on our free text search box ala Google. The requirement was simple: implement autosuggest based on previous user typed in free text queries which return some results back (as it would not be a great UX to suggest a term which returns 0 results back). Implementing this required: a. persist user typed in queries somewhere b. write an autosuggest component based on a. and glue it to the UI to return suggestion list to the user.
Part I: Setting up solr index
Since, we already use solr 1.4 and I am not big fan of making database do anything more than just persist/retrieve data, I knew I would rather write my autosuggest component utilizing solr.
As far as I know, there are 3 ways (apart from wild card search) of achieving autosuggest using solr 1.4:
1. Use EdgeNGrams
2. Use shingles and prefix query.
3. Use the new Terms component (in solr 1.4, the terms component does not have the regex support, which means it can only do a "begins with" match).
After a bit of contemplation and research, I decided to go with using NGrams. Basically, the decision was made by the process of elimination:
a. shingles & prefix query: I know AOL does this way but unfortunately there's very little documentation online around shingles and how you go about using it; I figured; I would be groping in the dark if I went ahead with this.
b. Terms component: Even after applying the regex patch from JIRA, I felt there is lot more sanitizing of input that needs to be done (things like if the user types in multiple terms like: "joan of", I need to ensure that I replace the whitespace with \s etc). I somehow felt it's easy to break this in case a user types in slightly odd search terms.
Having decided on going the EdgeNGrams route, all I needed was two fields in schema file. Here's how the solr field definition for the "suggest" field looks like:

<fieldtype name="suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="2" maxGramSize="50" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

The suggest field holds the NGrams, whereas the query field stores the query verbatim (apart from lowercasing it). This allows us to return the query back to the user in the suggestion list based on what he typed.
Part II: Building the index
Part I of the problem solved, now we have the solr index which would generate the NGrams based on the query that it needs to index and also store the query that would be returned as part of the suggestion list.
Now, we need to figure out how do we build this index i.e. how we get the user typed in queries into this index. Instead of indexing each and every query the moment the user types it in, I implemented a simple lazy writer (in fact two levels of lazy writer). Every time the user types in a query which return results, I store that in a cached dictionary. I hook on to CacheItemRemovedCallback and persist the aggregated queries into a database table, the db table also holds a DateModified timestamp column which is updated to getdate() for newly added items. We then have a offline task which runs on a nightly basis and indexes all the newest queries to the solr index. Part II solved, now we have the solr index built and ready to serve "suggestions".
Part III: gluing it with the presentation layer
There are two bits to the presentation layer:
a. Hooking the solr index to get back the suggestion list as JSON.
b. The presentation itself aka UX.

Since, we already use JQuery on our project I decided to go with Autocomplete plugin for UX. The plugin again lacks in-depth documentation but it wasn't very hard to get it working. Instead of calling the solr index directly from the autocomplete plugin (which would have the serious after-effect of exposing our solr endpoint to the public and also adding solr dependency to the UI layer), I decided to write a WCF REST shim around the solr index. Adding a shim/decorator on top of the solr also gave me a place to intercept the calls to the solr and sanitize the user input (basically escaping the Lucene special characters) and do some heavy duty caching. That's where I had a problem: WCF services in 3.5 don't have caching support built in, you need to install the REST Starter Kit to get the caching support but once I had it installed I was able to get caching working with my WCF REST service.

After thoughts
We haven't rolled out this implementation yet, so I don't know in terms of performance how it's gonna go but at least during my limited testing, I have been able to get quite fast response times. By the way, SOLR-1316 seems quite promising too and I've been watching that thread with quite some interest: it talks about implementing the terms component as a ternary search tree so that should be quite fast.

Tuesday, March 23, 2010

solr and indexing Chinese text

So, I've been kinda bugged with trying to get our search functionality upto the mark for the Chinese (I also have the other JK in the CKJ to worry about, but more on that later) language. As some of you would know, the biggest issue with the CJK language is that there are no word boundaries unlike the latin languages. Due to the missing word boundaries, the issue is how do you tokenize the text? There are some brute force ways of tokenizing (tokenize on uni-gram) and some middle of the road approaches (tokenize on bi-gram, which is what the lucene's built in CJKAnalyzer does). We started with the CJKAnalyzer but unfortunately figured out that it just doesn't cut it esp. due to the fact that CJKAnalyzer does not let you search on single character tokens (as it only tokenizes on Bi-grams). So, I decided to bite the bullet and upgrade solr to 1.4 (I guess I should post my experiences on that too some time) which also means that now we have lucene 2.9 to play with. One of the contrib package with Lucene 2.9 is the smartCn analyzer (http://issues.apache.org/jira/browse/LUCENE-1882), and I decided to give it a shot. The smartCn uses a built in dictionary and Hidden Markov model (I have no clue what it means) to "smartly" tokenize the chinese string. SmartCn analyzer did solve the problem of single character searches and might get us through the next release but I still feel it still doesn't solve the problem to my comfort level (the multiple character searches return less result using smartCn than the CJKAnalyzer). I guess, one way to solve it is to seed the built in dictionary with chinese terms which are relevant to our site, unfortunately the built in dictionary seems to be serialized into some binary format from a java class which I am having some hard time to decipher. I need to dig a little more deeper into this and maybe shoot out an email to the contributor of this package to see if there is some easy way of seeding the built in dictionary.

Thursday, March 11, 2010

oops, something when wrong...

So, there's a lotsa buzz about Google Reader Play and I was tempted enough to try it. But when I when (sic) there, all I got was a "oops, something when wrong..retrying" message:

Well, maybe that's what beta means these days...