Wikimedia blog

News from inside the Wikimedia Foundation.org

Posts Tagged ‘snapshots’

XML dumps resumed

Folks that use XML dumps of our projects will know that the dumps process has been stalled while we investigated bug 23264. We have been running individual project dumps manually and asking people to inspect them carefully. We have just started the automated dumps up again, and various code fixes should be checked in shortly. Thanks to all for your assistance and your patience.

If you are working with the XML dumps of the English language Wikipedia containing all page revisions (pages-meta-history), please note the following issues with the two completed runs.

The January 30 run is missing the text for a large number of old revisions of articles, primarily revisions created between January 1 2005 and May 14 2005. This was due to bug 20757 which was subsequently fixed. If you are doing analysis using the text data, you can retrieve the missing text by extracting it from an earlier file; see the archives.

The March 12 run is incomplete; it is missing about the last third of the revisions, due to early termination during the compression step.

The stubs files and the current page dumps appear to be fine, so statistical or other analyses that only use these files should not be impacted. The mysql table dumps are also unaffected.

We apologize for the inconvenience and are working on getting out a set of complete full history dumps with all revision text intact.

Wikimedia XML data sets released on Amazon Public Data Sets

For our community members that do analysis on Wikimedia project data, I’m happy to announce the release of our XML snapshots within Amazon Public Data Sets.

http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html

To those curious about why this happened … earlier this year I had gotten approached multiple times from researchers and community members wanting to parse our data but frustrated at the costs and time of doing it on their own infrastructure.

Many of them were already familiar with academic computational clusters and were wondering if there were any similar solutions available for them to do large scale processing. The tool sever was one option but sometimes it didn’t provide the level of flexibility that they needed and/or their projects were too computationally intensive to run alongside other tasks.

Instead I mentioned that I had already been thinking of pushing our data sets in the Amazon cloud as they had a wealth of the infrastructure in place and had an infrastructure that our users who were familiar with.

Fast forward to now, we have our first release ready to be worked on and I’m sure that we will hear back about new and exiting discoveries that our communities make.

This isn’t the first Wikipedia data set to exist in Public Data Sets but it will be the first that Wikimedia is committing to supporting on a regular release cycle. Amazon will be picking it up every month and retaining copies for at least three months.

I’m excited to see the stats of how many people use it.

–tomasz

We’re adding an off site archive for Commons and the XML snapshots

Thanks are due to eBart consulting and User:Milosh for proving a backup server and storage array at their colocation facility in Europe. This server will store archives of our publicly available data of Wikimedia Commons and the XML snapshots.

Everyone knows that this has been long and coming as having an off site location for our data is extremely important for disaster recovery. With this archive in place we’ll have another external archive space for Commons image data to complement the one living at MIT.

Given the 10T’s donated were likely to also store yearly archives of the XML snapshots.

This won’t stop us from continuing to be rigorous about our internal backups for the same data along with keeping all of our users private data within our own data centers. It will simply be another physical space for us to archive our publicly available content.

While this off line mirror will only be used internally we have some other leads about other sponsors who might be able to offer a publicly available mirror. Over the next weeks we’ll be streamlining the off line archiving process and seeding the initial commons upload which currently comes in at just under 4T’s ! Once we make some sense of how best to manage the archiving process we’ll see who else is able to host our data.