Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Data analytics

What are readers looking for? Wikipedia search data now available

(Update 9/20 17:40 PDT)  It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

I am very happy to announce the availability of anonymous search log files for Wikipedia and its sister projects, as of today. Collecting data about search queries is important for at least three reasons:

  1. it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
  2. we can improve our search index by benchmarking improvements against real queries.
  3. we give outside researchers the opportunity to discover gems in the data.

Peter Youngmeister (Ops team) and Andrew Otto (Analytics team) have worked diligently over the past few weeks to start collecting search queries. Every day from today, we will publish the search queries for the previous day at: http://dumps.wikimedia.org/other/search/ (we expect to have a 3 month rolling window of search data available).

Each line in the log files is tab separated and it contains the following fields:

  1. Server hostname
  2. Timestamp (UTC)
  3. Wikimedia project
  4. URL encoded search query
  5. Total number of results
  6. Lucene score of best match
  7. Interwiki result
  8. Namespace (coded as integer)
  9. Namespace (human-readable)
  10. Title of best matching article

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. You can download a sample file. We collect data from both from the search box on a wiki page after the visitor submits the query, and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, this generates too much data.

Anonymous means that there is nothing in the data that allows you to map a query to an individual user: there are no IP addresses, no editor names, and not even anonymous tokens in the dataset. We also discard queries that contain email addresses, credit card numbers and social security numbers.

It’s our hope that people will use this data to build innovative applications that highlight topics that Wikipedia is currently not covering, improve our Lucene parser or uncover other hidden gems within the data. We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy, and the new data could help to give it a little bit of much-needed TLC. If you’ve got search chops then have a look at our Lucene external contractor position.

We are making this data available under a CC0 license: this means that you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. But we do appreciate it if you cite us when you use this data source for your research, experimentation or product development.

Finally, please consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Diederik van Liere, Product Manager Analytics

(Update 9/19 20:20 PDT) We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.

Improving the accuracy of the active editors metric

We are making a change to our active editor metric to increase accuracy, by eliminating double-counting and including Wikimedia Commons in the total number of active editors. The active editors metric is a core metric for both the Wikimedia Foundation and the Wikimedia communities and is used to measure the overall health of the different communities. The total number of active editors is defined as:

the number of editors with the same registered username across different Wikimedia projects who made at least 5 edits in countable namespaces in a given month and are not registered as a bot user.

This is a conservative definition, but helps us to assess the size of the core community of contributors who update, add to and maintain Wikimedia’s projects.

The de-duplication consists of two changes:

  1. The total active editor count now includes Wikimedia Commons (increasing the count).
  2. Editors with the same username on different projects are counted as a single editor (decreasing the count).

The net result of these two changes is a decrease of the number of total active editors averaging 4.4% over last 3 years.

De-duplication of the active editor count only affects our total number of active editors across the different Wikimedia projects, the counts within a single project are unaffected. We’ve also begun work on a data glossary as a canonical reference point for all key metrics used by the Wikimedia Foundation.

Background

(more…)

Meet the Analytics Team

Over the past few months, the Wikimedia Foundation has been gearing up a variety of new initiatives, and measuring success has been on our minds. It should come as no surprise that we’ve been building an Analytics Team at the same time. We are excited to finally introduce ourselves and talk about our plans.

The team is currently a pair of awesome engineers, David Schoonover and Andrew Otto, a veteran data-analyst, Erik Zachte, and one humble product manager, Diederik van Liere. (We happen to be looking for a JavaScript engineer — if beautiful, data-driven client apps are your thing – or you know someone, drop us a line!)

We’ve got quite a few projects under way (and many more ideas), and we’d like to briefly go over them — expect more posts in the future with deeper details on each.

First up: a revamp of the Wikimedia Report Card. This dashboard gives an overview of key metrics representing the health and success of the movement: pageviews, unique visitors, number of active editors, and the like.

Illustration of the revamped Reportcard

The new report card is powered by Limn, a pure JavaScript GUI visualization toolkit we wrote. We wanted non-technical community members to be able to interact with the data directly, visualizing and exploring it themselves, rather than relying on us or analysts to give them a porthole into the deep. As a drop-in component, we hope it will contribute to democratizing data analysis (though we plan to use it extensively across projects ourselves). So play around with the report card data, or fork the project on GitHub!

Kraken: A Data Services Platform

But we have bigger plans. Epic plans. Mythical plans. A generic computational cluster for data analytics, which we affectionately call Kraken: a unified platform to aggregate, store, analyze, and query all incoming data of interest to the community, built so as to keep pace with our movement’s ample motivation and energy.

How many Android users are there in India that visit more than ten times per month? Is there a significant difference in the popularity of mobile OS’s between large cities and rural areas of India? Do Portuguese and Brazilian readers favour different content categories? How often are GLAM pictures displayed off-site, outside of Wikipedia (and where)?

As it stands, answering any of these questions is, at best, tedious and hard. Usually, it’s impossible. The size of the success of Wikimedia projects is a double-edged sword, in that it makes even modest data analysis a significant task. This is something we aim to fix with Kraken.

More urgently, however, we don’t presently have infrastructure to do A/B testing, measure the impact of outreach projects, or give editors insight into the readers they reach with their contributions. From this view, the platform is a robust, unified toolkit for exploring these data streams, as well as a means of providing everyone with better information for evaluating the success of features large and small.

This points toward our overarching vision. Long-term, we aim to give the Wikimedia movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity’s knowledge to power applications, mash up into websites, and stream to devices.

Dream big!

Privacy: Counting not Tracking

The Kraken is a mythical Nordic monster with many tentacles, much like any analytics system: analytics touches everything — from instrumenting mobile apps to new user conversion analysis to counting parser cache lookups — and it needs a big gaping maw to keep up with all the data coming in. Unfortunately, history teaches us that mythical cephalopods aren’t terribly good at privacy. We aim to change that.

We’ve always had a strong commitment to privacy. Everything we store is covered by the Foundation’s privacy policy. Nothing we’re talking about here changes those promises. Kraken will be used to count stuff, not to track user behaviour. But in order to count, we need to store and we want you all to have a good idea of what we’re collecting and why we’re collecting it and we will be specific and transparent about that. We aim to be able to answer a multitude of questions using different data sources. Counts of visitors, page and image views, search queries and number of edits and new user registrations are just a few of the data streams currently planned; each will be annotated with metadata to make it easier to query. To take a few more examples: page views will be tagged to indicate which come from bots. Traffic from mobile phones will be tagged as mobile. By counting these different types of events and adding these kind of meta tags, we will be able to better measure our progress towards the Strategic Plan.

We’ll be talking a lot more about the technical details of the system we’re building, so check back in case you’re interested or reach out to us if you want to provide feedback about how to best use the data to answer lots of interesting questions while still preserving users’ privacy. This post only scratches the surface, but we’ve got lots more to discuss.

Talk to Us!

Sound exciting? Have questions, ideas, or suggestions? Well then! Consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Excited, and have engineering chops? Well then! We’re looking for a stellar engineer to help build a fast, intuitive, and beautiful toolkit for visualizing and understanding all this data. Check out the Javascript/UI Engineer job posting to learn more.

We’re definitely excited about where things are going, and we are looking forward to keeping you all up to speed on all our new developments.

Finally, we are hosting our first Analytics IRC office hours! Join us on July 30th, at 12pm PDT (3pm EDT / 9pm CEST) in #wikimedia-analytics to ask all your analytics and statistics related questions about Wikipedia and the other Wikimedia projects.

Best regards,

David Schoonover, Analytics Engineer
Andrew Otto, Analytics Engineer
Erik Zachte, Data Analyst
Diederik van Liere, Product Manager

US Education Program participants add three times as much quality content as regular new users

Wikipedia Education Program participants from the United States added more than three times as much quality content as regular new users, a quantitative analysis shows.

In the Wikipedia Education Program, professors assign their students to edit Wikipedia articles as a grade for class, assisted by volunteer Wikipedia Ambassadors. In fall 2011, 55 courses participated in the program in the United States, with students editing articles on the English Wikipedia. On average, these students added 1855 bytes of content that stayed on Wikipedia, compared to only 491 for a randomly chosen sample of new users who joined English Wikipedia in September 2011. These numbers establish that students who participate in the Wikipedia Education Program contribute significantly more quality content that stays on Wikipedia than other new users.

Examining the distribution of content that survived on Wikipedia for both of these groups, we found that almost half of the Wikipedia Education Program participants added 1,000 or more bytes that stayed on Wikipedia in the first six months. In contrast, more than half of the random sample of new editors added no content that stayed on Wikipedia in the first six months. The targeted recruitment of students, combined with the support provided by the Ambassador Program and instructors, results in a much larger percentage of new editors who contribute quality content to Wikipedia.

To understand the collective impact of the Wikipedia Education Program in fall 2011, we compared the amount of content students added to Wikipedia to the content added by the random sample of new editors. The numbers show that the 920 student editors who participated in the program in fall 2011 added the same amount of content as 2250 typical new editors (editors are defined as users who made at least one edit to an article). In terms of new content, students have twice the impact as typical new editors.

An important consideration for any outreach project is editor retention. Data showed that students who are introduced to editing Wikipedia through the U.S. Education Program are just as likely to continue editing as any other newcomer.

The Wikipedia Education Program has now grown to Egypt, Brazil and other regions beyond North America. With an increased global presence, measuring and understanding the contributions of new student editors (and how they differ from other new users that join Wikipedia) has gained importance. Establishing a common metric for measuring the impact of the Wikipedia Education Program on various Wikipedias is another key motivation for a quantitative study.

There’s a lot more work to be done on measuring the program’s impact. So, stay tuned for more information about these metrics.

Methodology for this research can be found at: http://meta.wikimedia.org/wiki/Research:Wikipedia_Education_Program_evaluation#Methods

Ayush Khanna, Data Analyst, Global Development

(with input from Mani Pande, Head of Global Development Research)

Techies learn, make, win at Foundation’s first San Francisco hackathon

Participants at the San Francisco hackathon in 2012

Participants at the San Francisco hackathon in January 2012

In January, 92 participants gathered in San Francisco to learn about Wikimedia technology and to build things in our first Bay Area hackathon.

After a kickoff speech by Foundation VP of Engineering Erik Möller (video), we led tutorials on the MediaWiki web API, customizing wikis with JavaScript user scripts and Gadgets, and building the Wikipedia Android app.  (We recorded each training; click those links for how-to guides and videos.)  We asked the participants to self-organize into teams and work on projects.  After their demonstration showcase, judges awarded a few prizes to the best demos.

(more…)

Do It Yourself Analytics with Wikipedia

As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

  • Wikihadoop
  • Diffdb
  • WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.

We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

  • Who has added template X in the last month?
  • Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Volume of contributions by registered users on the English Wikipedia until December 2010, colored by account age

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.

If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!

Diederik van Liere, Analytics Team

Data analytics at Wikimedia Foundation

This post is a follow-on to my previous post “What is Platform Engineering?” .  In this post, I’ll describe the history of our analytics work, talk about how we derive and distribute our statistics, and ask you to join us in building our platform.  Summary:  we’re hiring, and we want to tell you what a great opportunity this is.

Our Data Analytics team is responsible for building out our logging and data mining infrastructure, and for making Wikimedia-related statistics useful to other parts of the Foundation and the movement.  Up until fairly recently, Erik Zachte has been the main analytics person for Wikimedia (with support from many generalists here), working first as a volunteer building stats.wikimedia.org, then on behalf of Wikimedia Foundation starting in 2008.  It started off as a large number of detailed page view and editor statistics about all Wikimedia wikis, large and small, and has since been augmented to include various summary formats and visualizations.  As the movement has grown, it has played an increasingly important role in helping guide our investments.

Announcing the WikiChallenge Winners

Wikipedia Participation Challenge

Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.

The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.

We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!

Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”

We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.

The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.

All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.

What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.

We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.

We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.

Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.

Diederik van Liere
External Consultant, Wikimedia Foundation

Howie Fung
Senior Product Manager, Wikimedia Foundation

2011-10-26: Edited to correct description of the winning algorithm

Three weeks left in the Wikipedia Participation Challenge

There are still three weeks left in the Wikipedia Participation Challenge (see prior blog post)!  So far, the competition has exceeded our expectations.  As of this morning, 78 teams (167 total individuals) from across the world have participated in the competition, with a total of 735 entries submitted. Half of these teams have beat the benchmark we set at the beginning of the competition, which is a testament to the quality of the teams and their submissions.   We can’t wait to see what great algorithms the participants are developing.

There’s still time to jump in before the competition closes on September 20, 2011, so if you haven’t done so, download the data and start crunching.  And those who want to cheer from the sidelines may follow the competition on Kaggle’s leaderboard.

Howie Fung
Senior Product Manager, Wikimedia Foundation

Diederik van Liere
Research Consultant, Wikimedia Foundation

“Rate this Page” is Coming to the English Wikipedia

Since May, the Article Feedback Tool has been available on 100,000 English Wikipedia articles (see blog post). We have now kicked off full deployment to the English Wikipedia at a rate of about 370,000 articles per day and will continue at this rate until deployment is complete.

Ratings interface for the Article Feedback Tool

We wanted to take a moment to briefly recap what we’ve learned so far, what lies ahead, and how we can work with the community improve this feature.   Features like Article Feedback can always be improved, so we will continue to experiment, measure, and iterate based on user and community feedback, testing, and analysis of how the feature is being used.

Rating data from the tool is available for your analysis — please dig in and let us know what you find. Toolserver developers can also access the rating data (minus personal information) in real-time to develop new dashboards and views of the data.

What We’ve Learned So Far

 

Readers like to provide feedback. The survey we’re currently running shows that over 90% of users find the ratings useful.  Many of these raters see the tool as a way to participate in article development — when asked why they rated and article, over half reported wanting to “positively affect the development of the page.”

 

Users of the feedback tool also left some enthusiastic comments (as well as some critical ones) about the tool. For example:

The option to rate a page should be available on every page, all the time, once per page per user per day.

As a high school librarian, I want my students to assess the sources of information they use.  This feature forces them to consider the reliability of Wiki articles.  Glad you have it.

Ratings seem like an interesting idea, I feel like the metrics used to determine the overall value of the page are viable, and I’ll be interested to see how the feature fares when it’s rolled out and has some miles under its belt.

The vast majority of raters were previously only readers of Wikipedia.  Of the registered users that rated an article, 66% had no prior editing activity.  For these registered users, rating an article represents their first participatory activity on Wikipedia.  These initial results show that we are starting to engage these users beyond just passive reading, and they seem to like it.

The feature brings in editors. One of the main Strategic Goals for the upcoming year is to increase the number of active editors contributing to WMF projects.  The initial data from the Article Feedback tool suggests that reader feedback could become a meaningful point of entry for future editors.

Once users have successfully submitted a rating, a randomly selected subset of them are shown an invitation to edit the page. Of the users that were invited to edit, 17% attempted to edit the page.  15% of those ended up successfully completing an edit.  These results strongly suggest that a feedback tool could successfully convert passive readers into active contributors of Wikipedia.  A rich text editor could make this path to editing even more promising.

While these initial results are certainly encouraging, we need to assess whether these editors are, in fact, improving Wikipedia.  We need to measure their level of activity, the quality of their contributions, their longevity, and other characteristics.

Ratings are a useful measure of some dimensions of quality.  In its current form, the Article Feedback Tool appears to provide useful feedback on some dimensions of quality, while the usefulness of the feedback on other dimensions of quality is still an open research question. Completeness and Trustworthy (formerly “Well-Sourced”) appear to be dimensions where readers can provide a reasonable level of assessment.  Research shows that ratings along these dimensions are correlated with the length and amount of citations, respectively.  We need to determine whether the ratings in “Objective” and “Well-Written” meaningfully predict quality in those categories. We released public dumps of AFT data and would love to hear about new approaches of measuring how well ratings reflect article quality.

We received feedback from community members on how to improve the feature. We’ve received a fair amount of feedback from the community on the usefulness of AFT, mainly through IRC Office Hours and on the AFT discussion page.  There have been many suggestions on how to make the feedback tool more valuable for the community.  For example, the idea of having a “Suggestions for Improvement”-type comment box has been raised several times.  Such a box would enable readers to provide concrete feedback directly to the editing community on how to improve an article.  We plan to develop some kind of commenting system in the near future.

Illustration of a potential "Suggest Improvements" feature

AFT could help surface problematic articles in real time, as well as articles that may qualify for increased visibility. We’ve started experimenting with a dashboard for surfacing both highly rated and lowly rated articles.   Ultimately, the dashboard could help identify articles that need attention (e.g., articles that have been recently vandalized) as well as articles that might be considered for increased visibility (e.g., candidates for Featured Articles).  We will continue to experiment with algorithms that help surface trends in articles that may be useful for the editing community.

Next Steps

Over the coming weeks, we will continue to roll out the Article Feedback Tool on the English Wikipedia.  Once this rollout is complete, we will start planning the next version of the tool.  For those interested in following the discussion, we will be documenting progress on the Article Feedback Project Page.  We would love to get your feedback (pun intended!) on how the feature is being used, what’s working, and what might be changed.  We also encourage folks to dig into the data.  Once the feature is fully deployed, there will be mountains of data to sift through and analyze, which will be a boon to researchers and developers alike.

We’d especially like to encourage members of the community to get involved in the further development of the feature.  If you’re interested in getting involved (e.g., design input, data analysis/interpretation, bug-squashing, etc.), please drop a note on the project talk page.

Howie Fung, Senior Product Manager

Dario Taraborelli, Senior Research Analyst

Erik Moeller, VP Engineering and Product Development