Saturday, June 04, 2011

Dice Coefficient–A naive Similarity engine using Set Theory

One of the ways to keep a user engaged on your site is to build some sort of recommendation engine – wherein the application automatically recommends similar “content” based on the user browsing history. One classic example of this kind of recommendation engine would be the one at amazon, which shows you a list of recommendations based on your browsing history. Another approach of doing a recommendation engine would be to suggest the user a list of “similar” content based on the current item that he’s viewing, for e.g., if you are on a online music store, and currently viewing “Dark side of the moon” album by Pink Floyd, then perhaps the application can show you list of albums similar to this for e.g.”Wish You Were Here” by Pink Floyd & “Meddle” by Led Zeppelin etc. The similarity between two items in these cases is generally calculated based on some attributes which define the items. Obviously, similarity is a relative term – so to find which items are more similar to a given item, we need to score them – the one with the higher score is “more similar” than the one with the lower score for a given item. Hence, there is a need to calculate a similarity score between two items.

We had a similar problem to solve on our site - build a similarity engine based on the categories that the items are associated with. Categories are nothing more than “tags”; so to speak; in the current web 2.0 semantics. One very naive approach to calculate similarity score between two items X, Y; which have tags Tx & Ty respectively would be to find the number of tags which are common in both, so the similarity score between (X,Y) would be -
Sim(X,Y) = |{Tx} ∩ {Ty}|
For e.g. if
{Tx} = {“music”, “rock”, “pink floyd”, “cult”} for X=”Dark Side of the moon”, where Tx = list of tags for that album.
and
{Ty} = {"music","led zeppelin","cult","rock"} for Y="Meddle", where Ty = list of tags for Meddle.
Then Sim(X,Y) = 3.

We can similarly, calculate the similarity between all the items N in our collection (if you notice it is a O(N^2) operation- more on optimizing this later) and then easily find the Top-K most similar items for a given item as generally, you would only be showing top 10 or so similar items. Below is the sample C# code snippet -

`        static int Similarity(IList<string> doc1Tags, IList<string> doc2Tags)        {            HashSet<string> tx = new HashSet<string>(doc1Tags);            HashSet<string> ty = new HashSet<string>(doc2Tags);            tx.IntersectWith(ty);            return tx.Count;        }`

One big drawback with our above similarity engine is that it is heavily biased towards items with more number of tags – as items with more tags are more likely to have higher number of common elements. The other drawback is that there is no correlation between the similarity scores of two different items, it can be any integer, making it impossible to set any kind of threshold or find do cross-similarity analysis for e.g. you might want to limit the top-k similar documents to only documents which are really similar to reduce the noise documents or you might want to know if Doc A is more similar to Doc B than Doc C is to Doc D.

A better way of finding similarity would be to length normalize the similarity score, this way a. The score is not biased towards document with more tags & b.  The similarity score is always between 0 & 1, so it’s easier to set the thresholds if required. So how do we length normalize, our scores? Below is one way of doing it, by dividing the intersection with |Tx| and |Ty|, where |Tx| = length of Tags for X. This gives us:

Sim(X,Y) = (2*|Tx ∩ Ty|)/(|Tx|+|Ty|)

This is what the Dice Coefficient is, a way of finding similarity measure between two sets. So lets change our earlier code to find Dice Coefficient:

`        static float Similarity(IList<string> doc1Tags, IList<string> doc2Tags)        {            HashSet<string> tx = new HashSet<string>(doc1Tags);            HashSet<string> ty = new HashSet<string>(doc2Tags);            int lenTx = tx.Count;            int lenTy = ty.Count;            tx.IntersectWith(ty);            float diceCoeff = (float)(2*tx.Count)/(float)(lenTx + lenTy);            return diceCoeff;        }`

This is pretty much what we did on our project apart from few minor adjustments and optimizing the entire stuff so that we don’t iterate N^2 times (hint – Sim(X,Y) = Sim(Y,X). Also, I believe the F# sets are better equipped for this instead of the HashSet<T> class that is part of the BCL as F# sets are immutable i.e. when you do an intersection with another set, you get a completely new set instead of modifying the first set in-place – this is important when you are calculating similarity in a loop as you don’t want the original set to be touched. The other thing is to use a Heap structure for storing the Top-K documents for a given document instead of storing all the similar documents and then only picking the top K similar documents. Also, the loop for calculating similarity for a given document with other n documents; is a great candidate for parallelization using the TPL – pity we are still on 3.5! Have fun!

1 comment:

1. The time complexity is not O(N^2), is O(N).

Creating HashSets takes O(N), as well as the IntersectsWith.