document updated 3 months ago, on May 13, 2022

This article discusses content similarity detection. The goal is to, given a large number of documents, try to find which documents are most similar to each other. Here, the documents can be anything from natural-language documents, to individual code files, to the set of all users who gave a particular Netflix movie a high rating.

Note that it *is* possible to find exact solutions to these problems, by using exact string metrics and calculating the distance between every pair of documents. However, in practice this is **extremely** computationally expensive. This is doubly true because these algorithms are often used on very large datasets, for example, comparing every web page on the internet for duplicates. The computational complexity of comparing every pair of documents in a dataset is "n choose 2", which is O(n^{2}), which is unacceptable except for the smallest of datasets.

So instead we use approximation algorithms — algorithms that don't guarantee an exact answer, but rather provide a guaranteed limit on how inaccurate they can be, and in exchange they provide much better performance. (unfortunately approximation algorithms tend to be more complex and sometimes harder to understand)

One approach to this is to break each document into a set, and compare the similarity between each set. This is called the Jaccard similarity coefficient.

Now, sets are unordered, while documents are usually ordered. We can retain some amount of comparison-of-orderliness of documents by using shingling — we create the set by extracting overlapping chunks of the document. Shingling is very similar to (the same as?) the concept of n-grams.

Using Jaccard similarity coefficients doesn't automatically make the problem easier, as it still suggests comparing every pair of documents, which is O(n^{2}). However, it's possible to provide an *estimate* of Jaccard similarity by using algorithms like MinHash (or SimHash [2] ?)

Published implementations of the MinHash algorithm includes:

- datasketch, by Erkang (Eric) Zhu
- pincecone.io, which does similarity searches using the software-as-a-service model
- MinHash is an optional algorithm implemented inside Apache Lucene (more)
- MinHash is one of the algorithms that Apache Hivemall implements

I don't yet understand locality sensitive hashing, but apparently it's often used as a separate step, after running the MinHash algorithm.