|
Index of /dev/src/pl/roseland/VERSION2
|
Name Last modified Size Description
Parent Directory 03-Oct-2007 13:45 -
story_contents/ 23-Dec-2008 12:12 -
do_searches 27-Nov-2007 02:07 2k
searches/ 04-Oct-2007 01:36 -
report.html 04-Oct-2007 01:33 128k
blacklist_story_URLs.txt 04-Oct-2007 01:33 25k
step4_generate_index.pl 04-Oct-2007 01:33 5k
TODO.txt 04-Oct-2007 01:07 1k
picks.txt 04-Oct-2007 00:45 1k
step3_fetch_story_contents.pl 03-Oct-2007 17:53 10k
finallist_story_URLs.txt 03-Oct-2007 17:14 163k
step2_weed_out_story_URLs.pl 03-Oct-2007 17:14 7k
step1_collect_story_URLs.pl 03-Oct-2007 17:02 10k
story_URLs/ 03-Oct-2007 13:01 -
whitelist_story_URLs.txt 03-Oct-2007 02:35 2k
utils/ 03-Oct-2007 02:12 -
punted.txt 02-Oct-2007 04:38 3k
bak/ 02-Oct-2007 03:55 -
The process generally goes:
step1_collect_story_URLs
story_URLs/*
step2_weed_out_story_URLs
finallist_story_URLs.txt
step3_fetch_story_contents
story_contents/*
step4_generate_index
report.html OR searches/*
The format of story_URLs/*, whitelist_story_URLs.txt, and finallist_story_URLs.txt is:
- one entry per line
- within each line, fields are deliminted by tabs
- first field is the primary canonical URL
- second field is the cache URL, if one exists
The format of story_contents/* is:
- the top line of each file is just like the story_URLs/* entry, for that file. However, that line is modified:
- if the cache was the URL that had to be used, then its URL will be be preceded by a ">"
- the third field is the date that story was published (in perl format)
- the remainder of the file is the HTML contents of the story (with all the excess trimmed off, so they don't match any regexes)
- the remainder should have a <title> somewhere in its body
- optionally, the body can also contain a <subtitle>