Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘datasets’

Do It Yourself Analytics with Wikipedia

As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

  • Wikihadoop
  • Diffdb
  • WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.

We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

  • Who has added template X in the last month?
  • Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Volume of contributions by registered users on the English Wikipedia until December 2010, colored by account age

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.

If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!

Diederik van Liere, Analytics Team

Wikimedia XML data sets released on Amazon Public Data Sets

For our community members that do analysis on Wikimedia project data, I’m happy to announce the release of our XML snapshots within Amazon Public Data Sets.

http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html

To those curious about why this happened … earlier this year I had gotten approached multiple times from researchers and community members wanting to parse our data but frustrated at the costs and time of doing it on their own infrastructure.

Many of them were already familiar with academic computational clusters and were wondering if there were any similar solutions available for them to do large scale processing. The tool sever was one option but sometimes it didn’t provide the level of flexibility that they needed and/or their projects were too computationally intensive to run alongside other tasks.

Instead I mentioned that I had already been thinking of pushing our data sets in the Amazon cloud as they had a wealth of the infrastructure in place and had an infrastructure that our users who were familiar with.

Fast forward to now, we have our first release ready to be worked on and I’m sure that we will hear back about new and exiting discoveries that our communities make.

This isn’t the first Wikipedia data set to exist in Public Data Sets but it will be the first that Wikimedia is committing to supporting on a regular release cycle. Amazon will be picking it up every month and retaining copies for at least three months.

I’m excited to see the stats of how many people use it.

–tomasz