Data Competition: Announcing the Wikipedia Participation Challenge

We are pleased to announce the launch of the Wikipedia Participation Challenge, a data modeling competition to develop an algorithm that predicts future editing activity on Wikipedia. The competition is hosted by Kaggle, a platform for data modeling and prediction competitions.  The Participation Challenge is open to community members and anyone else who is interested in analyzing Wikipedia data.  This is the first of two data competitions the Wikimedia Foundation will sponsor this year.

The goal of this competition is to gain a better understanding of the factors that encourage or discourage people from editing Wikipedia. Increasing the number of active editors is one of our strategic priorities. Both the Wikipedia communities and the Wikimedia Foundation stand to benefit from models that quantify the factors that determine whether a Wikipedia editor is likely to continue contributing. The competition asks contestants to develop a model to predict the number of edits a given editor will make in six month’s time.

The data used in this competition comes from the publicly available English Wikipedia XML data dump.  An anonymous donor has generously contributed $10,000 as prize money. There will be a Grand Prize for the best prediction, as well as special prizes awarded for the use of open source software. The Grand Prize winner will also be given the opportunity to present their prediction model at the 2011 IEEE International Conference on Data Mining.  The competition starts today and will continue until September 20, 2011.

Head over to our competition portal, download the data, and start crunching the data! And don’t forget to follow us on Twitter: #wikichallenge and @dvanliere.

Howie Fung
Senior Product Manager, Wikimedia Foundation

Diederik van Liere
Research Consultant, Wikimedia Foundation

6 Show

6 Comments on Data Competition: Announcing the Wikipedia Participation Challenge

Tom Cloyd MS MA 3 years

Jason, what you offer is speculation, not data modeling, which is a whole different game. Your cynicism and preemptive, gratuitous over-generalizations are unwarranted. Data modeling is about trying to produce an abstract representation of patterns that demonstrably ARE in the data. The validity of a model is measurable, in several ways. The interpretation is dependent upon what addition assumptions one brings to the model, and like all argumentation is open to analysis and critique.

Why do you appear to assume bad faith relative to this who initiative? The WMF people I know are earnest, eager, serious, hard working, and want real answers. That’s way better than free-floating gratuitous over-generalizations.

This project is a terrific one. Someone is willing to pay anyone in the community who can solve a real problem, which when solved, can be used to promote the welfare of the community. And the problem with this is? There isn’t a problem. None whatsoever.

As for barriers to diversity which need to be better understood and resolved, here’s one which has been well-documented: We don’t have a good gender balance, and we lose female editors at a rate higher than for male editors. I don’t find this acceptable, and think we need to figure out what’s happening and try to fix it. Is this a bad idea? Hardly. Might some rigorous data modeling help us with it? You bet. So…get to work.

Jason 3 years

Here you go:

Are you nerdy? If yes, Wikipedia editting += 3
Are you in a relationship? If yes, Wikipedia editting -= 2
Do you have kids? If yes, Wikipedia editting -= (number of kids)
Are you currently out of work? If yes, Wikipedia editting += 2
Are you upper or middle class? If yes, Wikipedia editting += 1
Are you a young male between the age of 15 and 35? If yes, Wikipedia editting += 2

Listen, WMF, it’s obvious: people edit Wikipedia if they care, if they have the time, and if they can have the ability. A fundamental truth is this: the people who want to edit Wikipedia will edit Wikipedia but not all demographics equally want to do it. The Foundation cannot FORCE the editorship to be uniform across all demographics. Unfortunately, in the name of “promoting diversity”, the Foundation is going to interpret the results of this contest in a way that implies there’s barriers to participation that must be fixed. Policy trying to force a more uniform demographic. This is utterly utterly misguided.

gwern 3 years

Reading more, I’m pretty troubled by the selection of data: http://www.kaggle.com/c/wikichallenge/forums/t/674/sampling-approach

What’s the point of predicting only about recent editors, whose ranks have already been thoroughly harrowed by the endless tightening of policy and rise of deletionists? Wikipedia already has a horrendous reputation for screwing over contributors*, so anyone who does much editing (and whose departure would be noticed by the criterion) is self-selecting now.

* just the other day cryonics researcher Mike Darwin told me he had no interest in contributing because he was sure all his contributions would be reverted under an extremely narrow reading of WP:RS, and wondered whether his BLP article could just be deleted since he certainly wasn’t going to edit it into an article worth a damn

gwern 3 years

JS, massive vandalism by anons is easy to deal with. The bots have massively cut down on the load, as has widespread rollbacker and undo. After a few months, you don’t even notice it.

What you notice are other editors fact-bombing your articles, putting them up for deletion, chopping them up in futile reformattings, and deleting random lines. That’s the incredibly dispiriting part of working on Wikipedia. Look at the parting messages in WP:MISSING. How many of them are complaining about the petty anonymous vandalism? Now how many sound like my little summary just now…

I will be *very* interested to see what the models put weight on. I suspect that the key parameter will be ‘how many of their contributions get deleted or massively changed’, and the longer the time their contributions live before getting deleted, the more predictive of eventual quitting… (This doesn’t necessarily require deleted edits to be available to the modellers, although that would certainly help.)

WereSpielChequers 3 years

RE JS Pending changes is available, it is just that the community on the English language wiki rejected it. I would like to see it implemented, or alternatively some sort of flagged revisions as has been deployed in many language versions of wikipedia such as German. But we can’t blame the foundation if one of its projects opts out of an anti-vandalism tool, nor should we exaggerate and say we can’t get any initiative to deal with vandalism when so much has been done in recent years. The improved edit filters are preventing much if not most vandalism from happening and most that does happen is now reverted almost immediately by bots without human intervention. So we are a long way from the days when all vandalism had to be reverted manually.

As for the model, good luck, this sounds useful if rather complex. Would I be right in thinking that deleted edits are not available to these modellers?

JS 3 years

We can’t get pending changes (or any initiative to prevent the massive vandalism that occurs daily), but things like this and http://www.wired.co.uk/news/archive/2011-01/10/making-wikipedia-more-welcoming is the focus.

Leave a Reply

Your email address will not be published. Required fields are marked *