**Update**- In case you have to pair very large non-negative integers, do read my post on Elegant Pairing Function.

In my last post on Dice Coefficients I talked about how a nested NxN loop for finding similarity can be changed to a reducing inner loop since Similarity Score between X,Y is reversible i.e. Sim{X,Y} = Sim{Y,X}. This means if you have 26 sets in the universe (from A-Z), you would hit the same pair twice while calculating the Dice Coefficient between them. If we can somehow, store the similarity already calculated in the first iteration between two sets (say A,B), then we don’t have to re-calculate the similarity between (B,A). Hence, our nested loop can be rewritten as-

for (int i = 0; i < universe.length; i++) { for (int j = i + 1; j < universe.length; j++) { //calc similarity between set[i] & set[j] } }

By the way, for a very large value of N (universe.length), the average complexity of the above algorithm is still ~ O(N^2) but at least we have reduced the number of iterations by nearly half.

Now that we have a similarity score between two sets, we would need a mechanism to store this score and somehow tag it with the Pair for which the similarity is calculated. Also, it would be nice if the scores can be stored in a hashtable (as retrieval is on average O(1)) and make our hashtable key uniquely identify the pair for which the score is calculated. In short, we need some way to uniquely encode two docIds into a single number – enter "Cantor Pairing Function”. Pairing functions is a reversible process to uniquely encode two natural numbers into a single number. Calculating the “Cantor Pair” is quite easy but the documentation on the reversible process is a little convoluted. Anyway, below is the C# code for generating the unique number and then reversing it to get back the original numbers (for x,y>0).

static int CantorPair(short x, short y) { return ((x + y) * (x + y + 1)) / 2 + y; } static short[] Reverse(int z) { short[] pair = new short[2]; int t = (int)Math.Floor((-1D + Math.Sqrt(1D + 8 * z))/2D); int x = t * (t + 3) / 2 - z; int y = z - t * (t + 1) / 2; pair[0] = (short)x; pair[1] = (short)y; return pair; }

As you can see the CantorPair() returns back an int for two shorts to avoid any number overflows, if you are trying to generate the pair number for two ints, you would need to use long.

Thank you very much. I found your explanation very useful.

ReplyDeleteExcellent example, thanks!

ReplyDelete