I want to do cluster analysis on text files. ~~ (any time you can calculate some metric between elements in a set, you can perform clustering on that set)
~~

- String::Cluster::Hobohm [github]
- Clusterize
- Text::SenseClusters [sourceforge]
- [not Perl] ssdeep, which uses locality-sensitive hashing

- Text::LevenshteinXS
- Text::WagnerFischer
- String::Similarity the algorithm is by Myers (1986)
- Text::JaroWinkler
- ... lots more ...

- Category:String similarity measures
- Approximate string matching
- Locality-sensitive hashing is one way to approximate a nearest-neighbor search

- plagiarism detection
- bioinformatics, genetic sequence analysis (which includes clustering)
- nearest neighbor search

How do we do text-file clustering, but FASTER?
Random ideas:

**use a spatial index**— the DBSCAN article notes that this results in a O(n log n) runtime- more generally, any technique that speeds up the nearest neighbor search

**MinHash**- use an algorithm that approximates the edit distance
- use an algorithm that clusters suboptimally
- locality-sensitive hashing