Thursday, June 09, 2011

Cantor Pairing Function and Reversal

Update - In case you have to pair very large non-negative integers, do read my post on Elegant Pairing Function.

In my last post on Dice Coefficients I talked about how a nested NxN loop for finding similarity can be changed to a reducing inner loop since Similarity Score between X,Y is reversible i.e. Sim{X,Y} = Sim{Y,X}. This means if you have 26 sets in the universe (from A-Z), you would hit the same pair twice while calculating the Dice Coefficient between them. If we can somehow, store the similarity already calculated in the first iteration between two sets (say A,B), then we don’t have to re-calculate the similarity between (B,A). Hence, our nested loop can be rewritten as-
            for (int i = 0; i < universe.length; i++)
{
for (int j = i + 1; j < universe.length; j++)
{
//calc similarity between set[i] & set[j]
                }
}


By the way, for a very large value of N (universe.length), the average complexity of the above algorithm is still ~ O(N^2) but at least we have reduced the number of iterations by nearly half.


Now that we have a similarity score between two sets, we would need a mechanism to store this score and somehow tag it with the Pair for which the similarity is calculated. Also, it would be nice if the scores can be stored in a hashtable (as retrieval is on average O(1)) and make our hashtable key uniquely identify the pair for which the score is calculated. In short, we need some way to uniquely encode two docIds into a single number – enter "Cantor Pairing Function”. Pairing functions is a reversible process to uniquely encode two natural numbers into a single number. Calculating the “Cantor Pair” is quite easy but the documentation on the reversible process is a little convoluted. Anyway, below is the C# code for generating the unique number and then reversing it to get back the original numbers (for x,y>0).


static int CantorPair(short x, short y)
{
return ((x + y) * (x + y + 1)) / 2 + y;
}

static short[] Reverse(int z)
{
short[] pair = new short[2];
int t = (int)Math.Floor((-1D + Math.Sqrt(1D + 8 * z))/2D);
int x = t * (t + 3) / 2 - z;
int y = z - t * (t + 1) / 2;
pair[0] = (short)x;
pair[1] = (short)y;
return pair;
}


As you can see the CantorPair() returns back an int for two shorts to avoid any number overflows, if you are trying to generate the pair number for two ints, you would need to use long.

Saturday, June 04, 2011

Dice Coefficient–A naive Similarity engine using Set Theory

One of the ways to keep a user engaged on your site is to build some sort of recommendation engine – wherein the application automatically recommends similar “content” based on the user browsing history. One classic example of this kind of recommendation engine would be the one at amazon, which shows you a list of recommendations based on your browsing history. Another approach of doing a recommendation engine would be to suggest the user a list of “similar” content based on the current item that he’s viewing, for e.g., if you are on a online music store, and currently viewing “Dark side of the moon” album by Pink Floyd, then perhaps the application can show you list of albums similar to this for e.g.”Wish You Were Here” by Pink Floyd & “Meddle” by Led Zeppelin etc. The similarity between two items in these cases is generally calculated based on some attributes which define the items. Obviously, similarity is a relative term – so to find which items are more similar to a given item, we need to score them – the one with the higher score is “more similar” than the one with the lower score for a given item. Hence, there is a need to calculate a similarity score between two items.

We had a similar problem to solve on our site - build a similarity engine based on the categories that the items are associated with. Categories are nothing more than “tags”; so to speak; in the current web 2.0 semantics. One very naive approach to calculate similarity score between two items X, Y; which have tags Tx & Ty respectively would be to find the number of tags which are common in both, so the similarity score between (X,Y) would be -
Sim(X,Y) = |{Tx} ∩ {Ty}|
For e.g. if
{Tx} = {“music”, “rock”, “pink floyd”, “cult”} for X=”Dark Side of the moon”, where Tx = list of tags for that album.
and
{Ty} = {"music","led zeppelin","cult","rock"} for Y="Meddle", where Ty = list of tags for Meddle.
Then Sim(X,Y) = 3.

We can similarly, calculate the similarity between all the items N in our collection (if you notice it is a O(N^2) operation- more on optimizing this later) and then easily find the Top-K most similar items for a given item as generally, you would only be showing top 10 or so similar items. Below is the sample C# code snippet -

        static int Similarity(IList<string> doc1Tags, IList<string> doc2Tags)
{
HashSet
<string> tx = new HashSet<string>(doc1Tags);
HashSet
<string> ty = new HashSet<string>(doc2Tags);
tx.IntersectWith(ty);
return tx.Count;
}


One big drawback with our above similarity engine is that it is heavily biased towards items with more number of tags – as items with more tags are more likely to have higher number of common elements. The other drawback is that there is no correlation between the similarity scores of two different items, it can be any integer, making it impossible to set any kind of threshold or find do cross-similarity analysis for e.g. you might want to limit the top-k similar documents to only documents which are really similar to reduce the noise documents or you might want to know if Doc A is more similar to Doc B than Doc C is to Doc D.

A better way of finding similarity would be to length normalize the similarity score, this way a. The score is not biased towards document with more tags & b.  The similarity score is always between 0 & 1, so it’s easier to set the thresholds if required. So how do we length normalize, our scores? Below is one way of doing it, by dividing the intersection with |Tx| and |Ty|, where |Tx| = length of Tags for X. This gives us:

Sim(X,Y) = (2*|Tx ∩ Ty|)/(|Tx|+|Ty|)


This is what the Dice Coefficient is, a way of finding similarity measure between two sets. So lets change our earlier code to find Dice Coefficient:



        static float Similarity(IList<string> doc1Tags, IList<string> doc2Tags)
{
HashSet
<string> tx = new HashSet<string>(doc1Tags);
HashSet
<string> ty = new HashSet<string>(doc2Tags);
int lenTx = tx.Count;
int lenTy = ty.Count;
tx.IntersectWith(ty);
float diceCoeff = (float)(2*tx.Count)/(float)(lenTx + lenTy);
return diceCoeff;
}


This is pretty much what we did on our project apart from few minor adjustments and optimizing the entire stuff so that we don’t iterate N^2 times (hint – Sim(X,Y) = Sim(Y,X). Also, I believe the F# sets are better equipped for this instead of the HashSet<T> class that is part of the BCL as F# sets are immutable i.e. when you do an intersection with another set, you get a completely new set instead of modifying the first set in-place – this is important when you are calculating similarity in a loop as you don’t want the original set to be touched. The other thing is to use a Heap structure for storing the Top-K documents for a given document instead of storing all the similar documents and then only picking the top K similar documents. Also, the loop for calculating similarity for a given document with other n documents; is a great candidate for parallelization using the TPL – pity we are still on 3.5! Have fun!