Tuesday, March 23, 2010
solr and indexing Chinese text
So, I've been kinda bugged with trying to get our search functionality upto the mark for the Chinese (I also have the other JK in the CKJ to worry about, but more on that later) language. As some of you would know, the biggest issue with the CJK language is that there are no word boundaries unlike the latin languages. Due to the missing word boundaries, the issue is how do you tokenize the text? There are some brute force ways of tokenizing (tokenize on uni-gram) and some middle of the road approaches (tokenize on bi-gram, which is what the lucene's built in CJKAnalyzer does). We started with the CJKAnalyzer but unfortunately figured out that it just doesn't cut it esp. due to the fact that CJKAnalyzer does not let you search on single character tokens (as it only tokenizes on Bi-grams). So, I decided to bite the bullet and upgrade solr to 1.4 (I guess I should post my experiences on that too some time) which also means that now we have lucene 2.9 to play with. One of the contrib package with Lucene 2.9 is the smartCn analyzer (http://issues.apache.org/jira/browse/LUCENE-1882), and I decided to give it a shot. The smartCn uses a built in dictionary and Hidden Markov model (I have no clue what it means) to "smartly" tokenize the chinese string. SmartCn analyzer did solve the problem of single character searches and might get us through the next release but I still feel it still doesn't solve the problem to my comfort level (the multiple character searches return less result using smartCn than the CJKAnalyzer). I guess, one way to solve it is to seed the built in dictionary with chinese terms which are relevant to our site, unfortunately the built in dictionary seems to be serialized into some binary format from a java class which I am having some hard time to decipher. I need to dig a little more deeper into this and maybe shoot out an email to the contributor of this package to see if there is some easy way of seeding the built in dictionary.