document updated 17 years ago, on Apr 18, 2007
- get from download.wikimedia.org, of course
- one-pass — various tools allow the flat dump-file to be parsed and searched in the same
pass, obviating the need to load it into a database application first
Good tools
Advice
- pages-articles.xml currently (asof Dec 2006) expands to ~8GB. The original archive takes up ~1.5GB, meaning you need ~9.5GB of free space to extract it.
- it takes ~1 hour to download (at 500 K/s) over my cable modem
- it takes <1 hour to bzip2 -d at low priority
- At least on my core2duo machine, I read files in far faster (measuring total read+uncompress time, or alternately uncompressed bitrate) when they're gziped, versus bzip2 or
uncompressed. (however, my SQL parser routine is currently really slow, so bzip2 isn't noticably slower after parsing, at the moment)
Compressed-file indexing
Compressed files have two benefits: 1) they save disk space, 2) when there's plenty of CPU to spare, it allows the file to be read faster, because disk I/O is usually an important constraint.
However, I think that perhaps random-access might mitigate the second benefit? Epescially when you end up needing to read in a fair bit of extra data before and after the desired data, because it's compressed in blocks? So, if regularly accessing a file randomly, and when you have enough spare disk space, maybe it's easier to just index it uncompressed?