For our community members that do analysis on Wikimedia project data, I’m happy to announce the release of our XML snapshots within Amazon Public Data Sets.
To those curious about why this happened … earlier this year I had gotten approached multiple times from researchers and community members wanting to parse our data but frustrated at the costs and time of doing it on their own infrastructure.
Many of them were already familiar with academic computational clusters and were wondering if there were any similar solutions available for them to do large scale processing. The tool sever was one option but sometimes it didn’t provide the level of flexibility that they needed and/or their projects were too computationally intensive to run alongside other tasks.
Instead I mentioned that I had already been thinking of pushing our data sets in the Amazon cloud as they had a wealth of the infrastructure in place and had an infrastructure that our users who were familiar with.
Fast forward to now, we have our first release ready to be worked on and I’m sure that we will hear back about new and exiting discoveries that our communities make.
This isn’t the first Wikipedia data set to exist in Public Data Sets but it will be the first that Wikimedia is committing to supporting on a regular release cycle. Amazon will be picking it up every month and retaining copies for at least three months.
I’m excited to see the stats of how many people use it.