Index of /dev/src/pl/roseland/VERSION2
      Name                          Last modified       Size  Description

[DIR] Parent Directory 03-Oct-2007 13:45 - [DIR] story_contents/ 23-Dec-2008 12:12 - [TXT] do_searches 27-Nov-2007 02:07 2k [DIR] searches/ 04-Oct-2007 01:36 - [TXT] report.html 04-Oct-2007 01:33 128k [TXT] blacklist_story_URLs.txt 04-Oct-2007 01:33 25k [TXT] step4_generate_index.pl 04-Oct-2007 01:33 5k [TXT] TODO.txt 04-Oct-2007 01:07 1k [TXT] picks.txt 04-Oct-2007 00:45 1k [TXT] step3_fetch_story_contents.pl 03-Oct-2007 17:53 10k [TXT] finallist_story_URLs.txt 03-Oct-2007 17:14 163k [TXT] step2_weed_out_story_URLs.pl 03-Oct-2007 17:14 7k [TXT] step1_collect_story_URLs.pl 03-Oct-2007 17:02 10k [DIR] story_URLs/ 03-Oct-2007 13:01 - [TXT] whitelist_story_URLs.txt 03-Oct-2007 02:35 2k [DIR] utils/ 03-Oct-2007 02:12 - [TXT] punted.txt 02-Oct-2007 04:38 3k [DIR] bak/ 02-Oct-2007 03:55 -

The process generally goes:

    step1_collect_story_URLs
        story_URLs/*
    step2_weed_out_story_URLs
        finallist_story_URLs.txt
    step3_fetch_story_contents
        story_contents/*
    step4_generate_index
        report.html  OR  searches/*

The format of story_URLs/*, whitelist_story_URLs.txt, and finallist_story_URLs.txt is:
    
    - one entry per line
    - within each line, fields are deliminted by tabs
    - first field is the primary canonical URL
    - second field is the cache URL, if one exists

The format of story_contents/* is:

    - the top line of each file is just like the story_URLs/* entry, for that file.  However, that line is modified:
        - if the cache was the URL that had to be used, then its URL will be be preceded by a ">"
        - the third field is the date that story was published (in perl format)
    - the remainder of the file is the HTML contents of the story  (with all the excess trimmed off, so they don't match any regexes)
    - the remainder should have a <title> somewhere in its body
        - optionally, the body can also contain a <subtitle>