Sunday, May 16, 2010

Contextual spelling suggestions using solr SpellCheckComponent

One of the problems that I faced while trying to implement the spelling suggestions (or "Did you mean") in solr was with phrase queries. If a user types in multiple terms and say one or more of the terms are mistyped, solr provides suggestions on the individual terms (using edit distances) in isolation; and then collates the results of the top most suggestions and returns back as spellcheck.collation. It's all dandy if the user just types in 1 search term or if your searches are OR searches but in case you do an AND search (or some other form of AND using the dismax), there might be instances when the spellchecker's collation actually returns 0 results; and thus resulting in a bad user experience. Let me try to explain this with an example:
Assume that the user types in "chicen fayita sadwich" and you have just two documents in your index:

Doc 1 ==> chicken fajita
Doc 2==> veg sandwich
Now, since solr/lucene treat the terms in isolation what you get in spellcheck.collation is:
"chicken fajita sandwich". Unfortunately, if you have a AND search and the user does click on this "Did you mean" link, it would result in a query (chicken AND fajita AND sandwich) resulting in zero results.
So, how do you solve this? One way to solve this is to use Shingles for creating your "spelling corpus", the only trouble with Shingles is that the number of suggestions generated is bound by the "MaxShinglesSize" and thus if you set your MaxShingleSize to say 4, you only get suggestions up to 4 terms.
Another cleaner albeit slightly slower approach is to extend the solr.SpellCheckComponent, and fire another solr query on the collation itself; if the suggestions' results are greater than 0 (or some other threshold), you return the collation back else try with the second set of suggestion (or just blank out the collation in case you don't want to fire multiple queries). This is what I did (though I am not 100% sure if this is the right way of firing solr queries): extend Solr.SpellCheckComponent, & hook onto the processRequest method to get the handle of ResponseBuilder object (so that you can then get SolrRequest and SolrSearcher objects from this). Override toNamedList method and in that get the collation string, fire another solr query using SolrSearcher and check the results' count; if the suggestion.Count > originalQuery.Count * THRESHOLD, let the collation be as is else blank it out. The guts of this is the overridden toNamedList method, which is below in case somebody is interested:
    protected NamedList toNamedList(SpellingResult spellingResult, String 
origQuery, 
            boolean extendedResults, boolean collate) 
    {
        NamedList result = super.toNamedList(spellingResult, origQuery, 
extendedResults, collate);
        if(collate){
            String collation = (String) result.get("collation");
            if(collation!=null && collation.length() > 0 && builder!=null){
                //fire a query and get the results
                try {
                    //only add spelling suggestion in case results are less than 
some threshold
                    int hits = builder.getResults().docList.matches();
                    if(hits>MIN_THRESHOLD){
                        result.remove("collation");
                        //result.add("collation", "");
                        return result;
                    }
                    SolrIndexSearcher searcher = builder.req.getSearcher();
                    QParser qp = QParser.getParser(collation, "dismax", 
builder.req);
                    NamedList params = new NamedList();
                    params.add("rows", 0);
                    params.add("omitHeader","true");
                    SolrParams localParams = SolrParams.toSolrParams(params);
                    qp.setLocalParams(localParams);
                    Query q = qp.getQuery();
                    TopDocs docs = searcher.search(q, 1);
                    int suggestionHits = docs.totalHits;
                    //try to get hits for this query
                    log.info("current hits:" + hits);
                    log.info("total number of hits:" + suggestionHits);
                    if(suggestionHits <= hits*MULTIPLIER){
                        //remove the collation
                        result.remove("collation");
                        //result.add("collation", "");
                    }
                } catch (IOException e) {
                    log.error(e.toString());
                }
                catch (ParseException e) {
                    log.error(e.toString());
                }                
            }
        }
        return result;
    }