paperlined.org
apps > wikipedia > technical > database_dump

document updated 18 years ago, on Apr 19, 2007

SQL files

currently, the fastest/easiest way I've found to parse SQL files is to:
- leave $INPUT_RECORD_SEPARATOR at "\n" (basically, read 1MB of data in at a time, since that's approximately the longest line length)
- ignore any line that doesn't start with "INSERT INTO"
- remove the "INSERT INTO `table` VALUES " preamble from lines that do have it
- a nifty trick: the remainder of the line can now be parsed with perl's eval(), which is both simple and efficient
- eval() returns a whole long list of fields, not broken out into rows, so you have to then break it into rows manually
- granted, eval() is exceedingly dangerous, so if you have any doubts about the trustworthiness of the people who create the dump (or the path that it took to get to you), you should either use a completely different parser, or use the Safe module to ensure that only data-return can occur within eval(). The :base-core set seems to work fine, and doesn't slow the parsing noticably.
- - when I do the above, reading from a .gz file (which, without parsing, gets me ~2200kBps), I'm able to parse enwiki-20070402-page.sql (metadata for the full set of pages, no page contents) at ~1400kBps, or about 8 minutes total, which totally blows away any other parsing method I've tried
It's likely that using Parse::Flex would be a little faster than eval, though it would be significantly more complex than using Safe.pm.

XML files

My current preferred method of accessing the data is via XML::Smart.
Though hand-parsing with regexps is safe too, since <'s and such are escaped (eg. <) within the page contents, if any. (just remember to, after extracting data, to unescape & < > " and '.)

$INPUT_RECORD_SEPARATOR can be set to "</page>", which makes it easier to parse the file page-by-page. The string should never appear within the article text, since < and > are escaped there.