paperlined.org
apps > wikipedia > technical > database_dump

document updated 18 years ago, on Apr 18, 2007

get from download.wikimedia.org, of course
one-pass — various tools allow the flat dump-file to be parsed and searched in the same pass, obviating the need to load it into a database application first
- Bluemoose's DataBaseSearchTool, written in .NET

Good tools

Cross-namespace redirects. (one-pass from bz2)

Advice

pages-articles.xml currently (asof Dec 2006) expands to ~8GB. The original archive takes up ~1.5GB, meaning you need ~9.5GB of free space to extract it.
- it takes ~1 hour to download (at 500 K/s) over my cable modem
- it takes <1 hour to bzip2 -d at low priority
At least on my core2duo machine, I read files in far faster (measuring total read+uncompress time, or alternately uncompressed bitrate) when they're gziped, versus bzip2 or uncompressed. (however, my SQL parser routine is currently really slow, so bzip2 isn't noticably slower after parsing, at the moment)

Compressed-file indexing

Compressed files have two benefits: 1) they save disk space, 2) when there's plenty of CPU to spare, it allows the file to be read faster, because disk I/O is usually an important constraint.

However, I think that perhaps random-access might mitigate the second benefit? Epescially when you end up needing to read in a fair bit of extra data before and after the desired data, because it's compressed in blocks? So, if regularly accessing a file randomly, and when you have enough spare disk space, maybe it's easier to just index it uncompressed?

jumping into the middle of a bzip2 file: Yes, it can be done. The http://www.bzip.org/1.0.3/bzip2-manual-1.0.3.html#recovering mentions that data CAN be recovered from the middle of a file.
jumping into the middle of a 7z file: Yes, this says "random file decompressing" is specifically a targetted feature.