Announcing the WikiChallenge Winners

Translate This Post
Wikipedia Participation Challenge

Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.

The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.

We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!

Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”

We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.

The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.

All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.

What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.

We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.

We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.

Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.

Diederik van Liere
External Consultant, Wikimedia Foundation

Howie Fung
Senior Product Manager, Wikimedia Foundation

2011-10-26: Edited to correct description of the winning algorithm

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

9 Comments
Inline Feedbacks
View all comments

Congrats to everyone involved, a fun competition, and I hope of great benefit to Wikipedia. I want to point out a typo though, Benjamin and Fridolin Roth did not use random forests, but rather vanilla/standard linear regression. Unfortunately there turned out to be a mistake in the data set construction, it was not properly randomized. As such Benjamin and Fridolin, no fault to them, stumbled upon this error and used it as the primary input to their model. This error happens to be equivalent to knowing a large fraction of the answer, future edits, before predicting it. Unfortunately this means… Read more »

The winning algorithm’s page directs to the source code on dumps.wikimedia.org, but that URL yields a 404. 🙁

[…] and is filed under Data analytics, Research. You can follow any responses to this entry through the RSS 2.0 […]

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.” Good!

So wait. What did we learn about the editors? This post is really detailing all the fancy shmancy models and such. But what were the key insights about “types of editors likely to leave/stay”? Imagine I am a CEO and want the “take away”? Is there something that you can express in terms of easy to understand descriptions of behavior? And perhaps it gives us even deeper insights about why X leads to Y?

Teh IP: That’s a thoughtful question. It’s generally the case that more complex models (which many of these winning entries use) tend to be less explainable/understandable. More importantly for the Wikichallenge, though, is that prediction is almost entirely dominated by features based on the times/rates/etc. at which these editors have historically made edits, essentially capturing (obvious) intuitions like, “editors who have been in recent months making a steady rate of edits are more likely to stay.” (I’m a competitor who ended in 8th place and read the winners’ writeups.)

thanks. That was at least an insight. 😉 I wonder what input factors you had to consider? If it is just a giant number churning exercise based off of the “easy to get out of the database” aspects, than you may not be considering factors that would give you the strongest insights. Demographics for instance. Also perhaps interaction (are there certain people that are “attractors/motivators” and others that are “driving people away”)? I would not give up at finding this info out either. Perhaps fancy shmancy neural nets are not the way to get key insights, but instead doing some… Read more »

Could the admin of the website recheck the url for the source code. For the purpose of learning, I would like to see the whole project.
In other words, some of the urls are broken in source code (http://dumps.wikimedia.org/other/wikichallenge/). Could you reupload them?