This Sankey diagram shows how readers reach the English Wikipedia article about London and where they go from there, based on the Wikipedia Clickstream data set. Graph by Ellery Wulczyn and Dario Taraborelli, CC0.

This Sankey diagram shows how readers reach the English Wikipedia article about London and where they go from there, based on the Wikipedia Clickstream data set. Graph by Ellery Wulczyn and Dario Taraborelli, CC0.

Wikipedia and Wikimedia projects are among the most visited repositories of human knowledge. They are also a unique source of data for understanding how we collaborate to create that knowledge, access it and share it with others.

The Wikimedia Foundation’s Research and Data Team has recently published a number of open data sets about Wikimedia projects, making them freely available to everyone – researchers, developers and community members – under a CC0 license.  These aggregate data sets were collected to show general trends about how people use Wikimedia projects and do not include any personal information about users, as required by Wikimedia’s privacy policy.

We invite you to turn this data into useful insights, applications and visualizations, and help our communities and projects thrive. If you have any questions on these releases, feel free to reach out to the Research and Data team via the Analytics mailing list or our #wikimedia-research channel on IRC.

Dario Taraborelli
Senior Research Scientist, Research and Data Team Lead
Wikimedia Foundation

Open Data Sets

Scholarly citations in Wikipedia
A data set of citations to scholarly articles in the English Wikipedia. Includes all citations with DOIs and PubMed identifiers added to Wikipedia articles as of the most recent content dump.
Halfaker, A., Taraborelli, D. (2015). Scholarly article citations in Wikipedia. figshare.
doi:10.6084/m9.figshare.1299540

Wikipedia clickstream
This data set shows how people get to a Wikipedia article and what links they click on next. The most recent release captures 22 million pairs (referer, resource), extracted from a total of 3.2 billion requests to the English Wikipedia. We wrote a step-by-step tutorial and IPython notebook to get you started with this data.
Wulczyn, E., Taraborelli, D. (2015). Wikipedia Clickstream. figshare.
doi:10.6084/m9.figshare.1305770

Browser choices of Wikimedia users
This data set provides statistics on the top browsers and platforms used by readers and editors on Wikimedia projects, obtained from the Wikimedia HTTP request logs during a 90-day window. You can also explore this data online via this application.
Keyes, O. (2015). Browser Choices of Wikimedia Readers and Editors. figshare.
doi:10.6084/m9.figshare.1326739

Where in the world is Wikipedia?
This data set includes the proportion of traffic to Wikimedia projects originating from a specific country, computed from all HTTP requests collected over the course of 2014. You can also explore this data online via this application.
Keyes, O. (2015). Geographic Distribution of Wikimedia Traffic. figshare.
doi:10.6084/m9.figshare.1317408

Wikipedia Article Feedback corpus
The Article Feedback experiment invited readers to participate on Wikipedia by leaving comments on articles, to help editors improve them. This data set includes over 1.5 million messages posted to the English, French and German Wikipedia during the pilot.
Florin, F., Mullie, M., Taraborelli, D. (2014). Wikipedia Article Feedback corpus. figshare.
doi:10.6084/m9.figshare.1277784