document updated 18 years ago, on Dec 13, 2007

Be able to classify sites, where they fit a neat definition. The most obvious is "is this a search engine?", since search-engine referers are very clearly qualitatively different than non-search-engine referers. (one has human-hand-selected-links, the other does not)
Be able to detect and remove duplicates, whever possible.
- Often, this is exactly the same as converting every URL to its canonical form
- In a few cases, there is NO canonical URL (eg. there's a junk-data field in the URL... however, this field has constraints on it, and can't simply be set to "1" or something like it, and there's no way to quickly determine the "lowest" random field value). In this case, we do need to implement what might be considered the more naive/direct duplicate-detection method.