paperlined.org
academics > computer_science
document updated 1 year, 2 months ago, on Feb 1, 2023

Algorithms for content similarity detection

This article discusses content similarity detection algorithms. The task is to, given a large number of documents, find which documents are most similar to each other (see Wikipedia articles 'string metric' and 'similarity measure'). Here, the documents can be anything from natural-language documents, to individual code files, to the set of all users who gave a particular Netflix movie a high rating.

Note that it is possible to find exact solutions to these problems, by using exact string metrics and calculating the distance between every pair of documents. However, in practice this is extremely computationally expensive. This is doubly true because these algorithms are often used on very large datasets, for example, comparing every web page on the internet for near-duplicates. The computational complexity of comparing every pair of documents in a dataset is "n choose 2", which is O(n2), which is unacceptable except for the absolutely smallest datasets.

So instead we use approximation algorithms — algorithms that don't guarantee an exact answer, but rather provide a guaranteed limit on how inaccurate they can be, and in exchange they provide much better performance. (One downside seems to be that approximation algorithms are more complex and harder to understand? To my eyes, at least.)

Set similarity / bag similarity

One approach to this is to break each document into a set, and compare the similarity between each set. This is called the Jaccard similarity coefficient.

Now, sets are unordered, while documents are usually ordered. We can retain some amount of comparison-of-orderliness of documents by using shingling — we create the set by extracting overlapping chunks of the document. Shingling is very similar to (the same as?) the concept of n-grams.

Using Jaccard similarity coefficients doesn't automatically make the problem easier, as it still suggests comparing every pair of documents, which is O(n2). However, it's possible to provide an estimate of Jaccard similarity by using algorithms like MinHash or SimHash [2].

Published implementations of the MinHash algorithm includes:

Published implementations of Jaccard similarity:

Similarity without using sets

Cosine similarity is one common alternative.

Locality sensitive hashing

I don't yet understand locality sensitive hashing, but apparently it's often used as a separate step, after running the MinHash algorithm.

modules in Perl that might be useful