Sunday, March 22, 2009

solr and locale sensitive sort

Another of those meaningless exercises:
============================
We've been using solr on our asp.net project for quite a bit now and have been quite pleased with it, everything went great till we had to support character based language (Chinese to be more specific). Based on my research, I decided to use CJKAnalyzer for analyzing and tokenizing and the results returned were generally acceptable, the only issue being the results were not sorted in any rather random order if I asked solr to sort results alphabetically. After couple of hours of hunting solr user groups and looking at the solr source, I realized that solr does'nt provide any support for locale sensitive sorting. Since finding out the problem is half the problem already solved, all I had to do was extend from solr's built-in StrField; add support for another attribute (locale) in schema.xml and pass that to new instance of SortField and voila, solr was happily returning the results sorted by pinyin. In case you are interested, below is the discussion around the problem and the JIRA issue for solr (which has the new locale sensitive custom java class):

http://www.nabble.com/CJKAnalyzer-and-Chinese-Text-sort-td22374195.html

https://issues.apache.org/jira/browse/SOLR-1073

1 comment:

  1. maybely,you can try to use ShuzhenAnalyzer,which is a analyzer for Chinese,and its dictionary can be full of endless Chinese words,the download website is : http://www.shuzhen.net

    ReplyDelete