document updated 1 year, 7 months ago, on Feb 1, 2023

This article discusses content similarity detection algorithms. The task is to, given a large number of documents, find which documents are most similar to each other (see Wikipedia articles 'string metric' and 'similarity measure'). Here, the documents can be anything from natural-language documents, to individual code files, to the set of all users who gave a particular Netflix movie a high rating.

Note that it *is* possible to find exact solutions to these problems, by using exact string metrics and calculating the distance between every pair of documents. However, in practice this is **extremely** computationally expensive. This is doubly true because these algorithms are often used on very large datasets, for example, comparing every web page on the internet for near-duplicates. The computational complexity of comparing every pair of documents in a dataset is "n choose 2", which is O(n^{2}), which is unacceptable except for the absolutely smallest datasets.

So instead we use approximation algorithms — algorithms that don't guarantee an exact answer, but rather provide a guaranteed limit on how inaccurate they can be, and in exchange they provide much better performance. (One downside seems to be that approximation algorithms are more complex and harder to understand? To my eyes, at least.)

One approach to this is to break each document into a set, and compare the similarity between each set. This is called the Jaccard similarity coefficient.

Now, sets are unordered, while documents are usually ordered. We can retain some amount of comparison-of-orderliness of documents by using shingling — we create the set by extracting overlapping chunks of the document. Shingling is very similar to (the same as?) the concept of n-grams.

Using Jaccard similarity coefficients doesn't automatically make the problem easier, as it still suggests comparing every pair of documents, which is O(n^{2}). However, it's possible to provide an ** estimate** of Jaccard similarity by using algorithms like MinHash or SimHash [2].

Published implementations of the MinHash algorithm includes:

- datasketch, by Erkang (Eric) Zhu
- pincecone.io, which does similarity searches using the software-as-a-service model
- MinHash is an optional algorithm implemented inside Apache Lucene (more)
- MinHash is one of the algorithms that Apache Hivemall implements

Published implementations of Jaccard similarity:

Cosine similarity is one common alternative.

I don't yet understand locality sensitive hashing, but apparently it's often used as a separate step, after running the MinHash algorithm.

- MinHash
- Text::SpeedyFx (if you're confused somewhat, read the source code to minhash_cmp)

- Jaccard similarity
- Set::Similarity::Jaccard. Other related packages by the same author:
- Set::Similarity::BV::Jaccard, an implementation using fast bit vectors
- Bag::Similarity::Jaccard

- Set::Jaccard::SimilarityCoefficient

- Set::Similarity::Jaccard. Other related packages by the same author:
- cosine similarity
- besides the above packages which sometimes include cosine similarity, there's also Set::Similarity::CosinePP

- shingling
- other
- SimString::Wrapper, a wrapper for SimString