document updated 18 years ago, on Dec 13, 2007
- Be able to classify sites, where they fit a neat definition. The most obvious is "is this a search engine?", since search-engine referers are very clearly qualitatively different than non-search-engine referers. (one has human-hand-selected-links, the other does not)
- Be able to detect and remove duplicates, whever possible.
- Often, this is exactly the same as converting every URL to its canonical form
- In a few cases, there is NO canonical URL (eg. there's a junk-data field in the
URL... however, this field has constraints on it, and can't simply be set to "1" or something
like it, and there's no way to quickly determine the "lowest" random field value). In this
case, we do need to implement what might be considered the more naive/direct duplicate-detection method.
Canonical form
- adding or removing the www. prefix