academics > computer_science
document updated 6 months ago, on May 13, 2022

Content similarity detection

This article discusses content similarity detection. The goal is to, given a large number of documents, try to find which documents are most similar to each other. Here, the documents can be anything from natural-language documents, to individual code files, to the set of all users who gave a particular Netflix movie a high rating.

Note that it is possible to find exact solutions to these problems, by using exact string metrics and calculating the distance between every pair of documents. However, in practice this is extremely computationally expensive. This is doubly true because these algorithms are often used on very large datasets, for example, comparing every web page on the internet for duplicates. The computational complexity of comparing every pair of documents in a dataset is "n choose 2", which is O(n2), which is unacceptable except for the smallest of datasets.

So instead we use approximation algorithms — algorithms that don't guarantee an exact answer, but rather provide a guaranteed limit on how inaccurate they can be, and in exchange they provide much better performance. (unfortunately approximation algorithms tend to be more complex and sometimes harder to understand)

Set similarity

One approach to this is to break each document into a set, and compare the similarity between each set. This is called the Jaccard similarity coefficient.

Now, sets are unordered, while documents are usually ordered. We can retain some amount of comparison-of-orderliness of documents by using shingling — we create the set by extracting overlapping chunks of the document. Shingling is very similar to (the same as?) the concept of n-grams.

Using Jaccard similarity coefficients doesn't automatically make the problem easier, as it still suggests comparing every pair of documents, which is O(n2). However, it's possible to provide an estimate of Jaccard similarity by using algorithms like MinHash (or SimHash [2] ?)

Published implementations of the MinHash algorithm includes:

Locality sensitive hashing

I don't yet understand locality sensitive hashing, but apparently it's often used as a separate step, after running the MinHash algorithm.