paperlined.org
dev > perl > modules > documentation

document updated 13 years ago, on May 12, 2012

I was using this to try to cluster text files with an average size of 10kb, to find text files that are closely-related. Doing this with 10 files took ~1 minute. Trying this with 100 files, I gave up. I didn't realize clustering consumed so much CPU.

Some ways to explore making this process faster: (caveat: I don't know much about clustering)

do feature-extraction to reduce the string size
DBSCAN notes that it can achieve O(n log n) if a spatial index is used
locality-sensitive hashing
MinHash