Sunday, July 27, 2008

my thoughts on solr

Recently, we had a requirement to provide some advanced searching capabilities for a public asp.net web site: one of the most important feature was to support faceted navigation on a set of hierarchical fields. After having done my research on search providers which currently exist in the market: Google search for enterprise, Mercado, Endeca, Omniture etc, we decided to go with solr (yes, the price was one of the major deciding factors: solr is free!) and I have to admit I haven't been disappointed with using it. Since, this is a windows/.net shop I to use solrsharp as the bridge. Installing Jetty+solr was a breeze but I ran into issues with installing Jetty as a service (so that we didn't have to run solr in console mode); also, I read somewhere that Jetty is not great in handling unicode characters (and since ours is a multi-lingual site; this was a big negative): so Jetty gave way to Tomcat; installing Tomcat as a service was easy on Win32 but I figured out that getting it to run on a 64bit Windows as a 64bit application required patching the Tomcat files from svn repo; after patching Tomcat with the required executables it installed just fine on a 64 bit m/c.
Since, solr supports faceted navigation out of the box (along with complex boolean queries), we had absolutely no issues in meeting the requirements; I did run into issues with solrsharp not handing unicode characters (had to patch it) and not supporting NOT searches (had to patch again) but once I patched solrsharp; things have been going on great. Currently, I use the Standard request handler (and it looks like solrsharp doesn't support specifying any different request handler during query time) which works fine for most of the cases but sometimes I've seen where my search results are quite off specially when you are searching on multiple terms for e.g. a search on "Joan of Arc" (w/o the quotes) should ideally return a document where Joan of Arc appear in close proximity first; than other documents; but the resultset returned by solr seem to be more based on words collision (the number of times a word appears in a doc). Also, if you're using the standard request handler; you can't assign different weights to different fields in the document so all fields are treated as same (you might want to assign more weight to Title field than description): the bad news is standard request handler can't handle all this complex stuff; the good news is there's another request handler just for this: Dismax. So, I just need to patch solrsharp so that I can tell it which request handler to use (or it can pick it from some config file) and we should be able to use Dismax as our request handler. All in all, it's been a breeze using solr: it's quite fast, supports clustering, runs just fine on win architecture and above all is free & open source!
Moral of the story: If you are planning to provide search functionality in your next site/application: use solr!

6 comments:

  1. could you please mention the files you tomcat files you patched to get tomcat working on a 64bit machine.

    regards

    ReplyDelete
  2. This should help:
    http://blog.datajelly.com/company/blog/35-running-tomcat-as-a-64-bit-windows-service.html

    ReplyDelete
  3. Can you suggest a good web hosting provider providing windows and java, .net.

    ReplyDelete
  4. Depends on what kind of site you are looking at hosting, the traffic the site will attract etc. For starters, godaddy.com is quite good.

    ReplyDelete
  5. Hello I'm looking to do a very similar thing (asp.net front end for solr) i'd be really grateful if you could share some source code.

    My understanding of solr is that it doesn't do web crawling or word doc or pdf indexing) are these things you tackled? if so i'd like to understan how.

    Thanks in advance!

    ReplyDelete
  6. sagey, for actual web crawling you'd use Nutch, not SOLR.

    And since version 1.4 SOLR has been able to index PDF/Doc/Flash/kitchen sink by using Tika.

    ReplyDelete