to-do: - convert the metadata to .xml format... the idea is to, OVER TIME, separate the code from the data as much as possible... this isn't even 100% possible currently, but perhaps it will be eventually? - move as many domains over to the getkey/mapkey system, rather than using more regexps than are necessary - move to a paradigm where the tool TELLS ME when it needs extra information... it hilights all the possible problems (eg. possible extra affiliate ID; possible extraneous www.; unrecognized host; etc) (so that I don't have to constantly double-check on it... it knows much quicker than I do) - do more pre-processing on the host rules, to simultaneously allow the metadata-specification file to be simpler, while also allowing the matching to run quickly - at the very least, allow ALL of one host's data to be located in ONE spot - (possibly?) allow the tool to UPDATE the metadata itself, after I directly answer questions from it... - ESPECIALLY when there are %per_page_key's... it should be able to FIGURE out what each key is, automatically!, and then cache its result - it should be able to FIGURE OUT if www. redirects to the base one or not (and visa versa) - begin including metadata on DIFFERENT TYPES of pages... right now, it's really FHG-centric... start including metadata on TGP's, TGP-thumbs, and FHG-thumbs. (I've got TGP and TGP-thumb metadata included in the main module, but again, it needs to start being separated out) Anyway, this CAN be a more gradual transition: iteration 1: code and data tightly integrated; data formatted purely for speed, with no regard for readability/maintainability iteration 2: adds a preprocessor to allow data to be converted to a format that's speedy, but stored in a format that's readable/maintainable (though still in perl); at least SOME pressure to separate code and data whenever possible iteration 3: convert data from perl to XML format; increasing pressure to move hard-coded exceptions to pure-data iteration 4: ALL hard-coded exceptions have been moved to pure-data; XML format is writeable as well as readable, allowing the code to move its proprietary cache data into Other TODO items (unrelated to the "multiple great migrations" outlined above): - remember ALL FHG URLs we run across... at the very least, this can be useful for historical analysis (if not making our own collection of stuff) - possibly even cache the thumbnails? - START BEING ABLE TO RUN GOOGLE DATA (as well as historical TGP data) through this in BATCH MODE. It should continue to support the real-time mode as well, but since the metadata here is increasingly becoming the most valuable part of this, it would be nice to 1) leverage it for other things, 2) make it more mature by using it on a wider variety of data. - RECORD ALL TGP->FHG relationships... while this takes up extra space, and isn't TERRIBLY valuable... it's still actually useful data, and since we're trying to get the most-bang-for-the-byte-downloaded, we should record this too Priority/schedule: - start recording all FHG URLs (we can record their thumbnails at a later time perhaps) - transition from iter#1 to iter#2 - have it hilight metadata improvements it's almost certain need to be made - get the batch-mode version working (either with historical data, or google data, or both) What are the goals here? - make a large (but static) archive of plain FHG URLs available? (doesn't lend itself to a "community", but they're much simpler to digest anyway) - make the metadata available (lends itself to community-sharing... on the other hand, the number of people who can write or digest these things is pretty small... on the other hand, they're EXTREMELY useful) -