Algorithms for content similarity detection

This article discusses content similarity detection algorithms. The task is to, given a large number of documents, find which documents are most similar to each other (see Wikipedia articles 'string metric' and 'similarity measure'). Here, the documents can be anything from natural-language documents, to individual code files, to the set of all users who gave a particular Netflix movie a high rating.

Note that it is possible to find exact solutions to these problems, by using exact string metrics and calculating the distance between every pair of documents. However, in practice this is extremely computationally expensive. This is doubly true because these algorithms are often used on very large datasets, for example, comparing every web page on the internet for near-duplicates. The computational complexity of comparing every pair of documents in a dataset is "n choose 2", which is O(n²), which is unacceptable except for the absolutely smallest datasets.

So instead we use approximation algorithms — algorithms that don't guarantee an exact answer, but rather provide a guaranteed limit on how inaccurate they can be, and in exchange they provide much better performance. (One downside seems to be that approximation algorithms are more complex and harder to understand? To my eyes, at least.)

Set similarity / bag similarity

One approach to this is to break each document into a set, and compare the similarity between each set. This is called the Jaccard similarity coefficient.

Now, sets are unordered, while documents are usually ordered. We can retain some amount of comparison-of-orderliness of documents by using shingling — we create the set by extracting overlapping chunks of the document. Shingling is very similar to (the same as?) the concept of n-grams.

Using Jaccard similarity coefficients doesn't automatically make the problem easier, as it still suggests comparing every pair of documents, which is O(n²). However, it's possible to provide an estimate of Jaccard similarity by using algorithms like MinHash or SimHash [2].

Published implementations of the MinHash algorithm includes:

datasketch, by Erkang (Eric) Zhu
pincecone.io, which does similarity searches using the software-as-a-service model
MinHash is an optional algorithm implemented inside Apache Lucene (more)
MinHash is one of the algorithms that Apache Hivemall implements

Published implementations of Jaccard similarity:

SimString [github]

Similarity without using sets

Cosine similarity is one common alternative.

Locality sensitive hashing

I don't yet understand locality sensitive hashing, but apparently it's often used as a separate step, after running the MinHash algorithm.

modules in Perl that might be useful

MinHash
- Text::SpeedyFx (if you're confused somewhat, read the source code to minhash_cmp)
Jaccard similarity
- Set::Similarity::Jaccard. Other related packages by the same author:
  - Set::Similarity::BV::Jaccard, an implementation using fast bit vectors
  - Bag::Similarity::Jaccard
- Set::Jaccard::SimilarityCoefficient
cosine similarity
- besides the above packages which sometimes include cosine similarity, there's also Set::Similarity::CosinePP
shingling
other
- SimString::Wrapper, a wrapper for SimString