Announcing the WikiChallenge Winners

Wikipedia Participation Challenge

Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.

The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.

We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!

Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”

We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.

The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.

All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.

What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.

We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.

We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.

Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.

Diederik van Liere
External Consultant, Wikimedia Foundation

Howie Fung
Senior Product Manager, Wikimedia Foundation

2011-10-26: Edited to correct description of the winning algorithm

6 Show

6 Comments on Announcing the WikiChallenge Winners

Teh IP 3 years

thanks. That was at least an insight. ;-)

I wonder what input factors you had to consider? If it is just a giant number churning exercise based off of the “easy to get out of the database” aspects, than you may not be considering factors that would give you the strongest insights. Demographics for instance. Also perhaps interaction (are there certain people that are “attractors/motivators” and others that are “driving people away”)?

I would not give up at finding this info out either. Perhaps fancy shmancy neural nets are not the way to get key insights, but instead doing some semi-quantitative work based on cases, interviews, etc. would lead to better insights. I also find that doing this sort of thing can often generate new hypotheses and ideas. Because you just learn things and generate new “factors” from getting a little dirty looking up close and personal at some cases.

It’s comparable to doing a manufacturing optimization problem. We could just use what data we have (now, easy, and in big numbers) from the shop floor computer system. That will give us some things…it will probably be robust (if you do out of sample confirmations). And you MIGHT learn something counter-intuitive, just because tribal wisdom doesn’t match the numbers. But pretty decent chance you might not be tracking a key feature. Also a good chance that there may be some problems somewhere in data input/tracking etc. that get missed…if it is all a hands length math game.

The next step up from that, might be to get on the white board and brainstorm, list ideas, diagram the process, make those fishbone causation charts, etc. and just list some possible factors. Then after doing that consider which ones have easy data to check on and which don’t (and if there are some that you think may be very critical…but lack data on…then you figure out some workarounds (e.g. cases) to at least improve your Bayesian hunches and try to validate/invalidate the hypotheses.

Perhaps even more of a step up is to do a walkthrough of the assembly line (or even take a shift or two and build the product personally!) This will probably give a few new ideas…and it may give insights on how to prioritize which factors to investigate, spark ideas for how to do workarounds where the factors are not tracked in the database, etc.

Of course, all the processes are important. And really it is a feedback loop as well.

that said…I’m still blown away how remote from actionable insights, the work here was. I went to the winner page and the discussion was all matrices and rows and set theory and such…nothing like…F=MA or V=IR or the like. ;-)

Sorry…if this is all motherhood and apple pie, but I felt it needed to be said. And I DO appreciate one person (you) boiling down some of the learning.

Yang 3 years

Teh IP: That’s a thoughtful question. It’s generally the case that more complex models (which many of these winning entries use) tend to be less explainable/understandable. More importantly for the Wikichallenge, though, is that prediction is almost entirely dominated by features based on the times/rates/etc. at which these editors have historically made edits, essentially capturing (obvious) intuitions like, “editors who have been in recent months making a steady rate of edits are more likely to stay.” (I’m a competitor who ended in 8th place and read the winners’ writeups.)

Teh IP 3 years

So wait. What did we learn about the editors? This post is really detailing all the fancy shmancy models and such. But what were the key insights about “types of editors likely to leave/stay”? Imagine I am a CEO and want the “take away”? Is there something that you can express in terms of easy to understand descriptions of behavior? And perhaps it gives us even deeper insights about why X leads to Y?

neitway 3 years

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.” Good!

Asaf 3 years

The winning algorithm’s page directs to the source code on dumps.wikimedia.org, but that URL yields a 404. :(

KTH 3 years

Congrats to everyone involved, a fun competition, and I hope of great benefit to Wikipedia.

I want to point out a typo though, Benjamin and Fridolin Roth did not use random forests, but rather vanilla/standard linear regression. Unfortunately there turned out to be a mistake in the data set construction, it was not properly randomized.

As such Benjamin and Fridolin, no fault to them, stumbled upon this error and used it as the primary input to their model. This error happens to be equivalent to knowing a large fraction of the answer, future edits, before predicting it. Unfortunately this means that their model is invalid and can’t be used by Wikipedia to understand participation. See the following link for more detail:

http://www.kaggle.com/forums/t/980/wikipedia-participation-challenge-an-unfortunate-ending

Leave a Reply

Your email address will not be published. Required fields are marked *