Comments on: Announcing the WikiChallenge Winners

By: Neil

Neil — Fri, 09 Oct 2015 19:00:02 +0000

Could the admin of the website recheck the url for the source code. For the purpose of learning, I would like to see the whole project.
In other words, some of the urls are broken in source code (http://dumps.wikimedia.org/other/wikichallenge/). Could you reupload them?

By: Teh IP

Teh IP — Fri, 18 Nov 2011 15:53:10 +0000

thanks. That was at least an insight. 😉
I wonder what input factors you had to consider? If it is just a giant number churning exercise based off of the “easy to get out of the database” aspects, than you may not be considering factors that would give you the strongest insights. Demographics for instance. Also perhaps interaction (are there certain people that are “attractors/motivators” and others that are “driving people away”)?
I would not give up at finding this info out either. Perhaps fancy shmancy neural nets are not the way to get key insights, but instead doing some semi-quantitative work based on cases, interviews, etc. would lead to better insights. I also find that doing this sort of thing can often generate new hypotheses and ideas. Because you just learn things and generate new “factors” from getting a little dirty looking up close and personal at some cases.
It’s comparable to doing a manufacturing optimization problem. We could just use what data we have (now, easy, and in big numbers) from the shop floor computer system. That will give us some things…it will probably be robust (if you do out of sample confirmations). And you MIGHT learn something counter-intuitive, just because tribal wisdom doesn’t match the numbers. But pretty decent chance you might not be tracking a key feature. Also a good chance that there may be some problems somewhere in data input/tracking etc. that get missed…if it is all a hands length math game.
The next step up from that, might be to get on the white board and brainstorm, list ideas, diagram the process, make those fishbone causation charts, etc. and just list some possible factors. Then after doing that consider which ones have easy data to check on and which don’t (and if there are some that you think may be very critical…but lack data on…then you figure out some workarounds (e.g. cases) to at least improve your Bayesian hunches and try to validate/invalidate the hypotheses.
Perhaps even more of a step up is to do a walkthrough of the assembly line (or even take a shift or two and build the product personally!) This will probably give a few new ideas…and it may give insights on how to prioritize which factors to investigate, spark ideas for how to do workarounds where the factors are not tracked in the database, etc.
Of course, all the processes are important. And really it is a feedback loop as well.
that said…I’m still blown away how remote from actionable insights, the work here was. I went to the winner page and the discussion was all matrices and rows and set theory and such…nothing like…F=MA or V=IR or the like. 😉
Sorry…if this is all motherhood and apple pie, but I felt it needed to be said. And I DO appreciate one person (you) boiling down some of the learning.

By: Yang

Yang — Fri, 18 Nov 2011 10:53:27 +0000

Teh IP: That’s a thoughtful question. It’s generally the case that more complex models (which many of these winning entries use) tend to be less explainable/understandable. More importantly for the Wikichallenge, though, is that prediction is almost entirely dominated by features based on the times/rates/etc. at which these editors have historically made edits, essentially capturing (obvious) intuitions like, “editors who have been in recent months making a steady rate of edits are more likely to stay.” (I’m a competitor who ended in 8th place and read the winners’ writeups.)

By: Teh IP

Teh IP — Wed, 16 Nov 2011 04:19:58 +0000

So wait. What did we learn about the editors? This post is really detailing all the fancy shmancy models and such. But what were the key insights about “types of editors likely to leave/stay”? Imagine I am a CEO and want the “take away”? Is there something that you can express in terms of easy to understand descriptions of behavior? And perhaps it gives us even deeper insights about why X leads to Y?

By: Wikimedia blog » Blog Archive » Wikimedia Foundation Report, October 2011

Thu, 10 Nov 2011 15:41:31 +0000

[…] [1] http://diff.wikimedia.org/2011/10/26/announcing-the-wikichallenge-winners/ […]

By: neitway

neitway — Sat, 29 Oct 2011 08:08:34 +0000

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.” Good!

By: Announcing the WikiChallenge Winners | My Blog

Announcing the WikiChallenge Winners | My Blog — Sat, 29 Oct 2011 06:21:00 +0000

[…] and is filed under Data analytics, Research. You can follow any responses to this entry through the RSS 2.0 […]

By: Asaf

Asaf — Thu, 27 Oct 2011 23:28:29 +0000

The winning algorithm’s page directs to the source code on dumps.wikimedia.org, but that URL yields a 404. 🙁

By: KTH

KTH — Wed, 26 Oct 2011 17:53:57 +0000

Congrats to everyone involved, a fun competition, and I hope of great benefit to Wikipedia.
I want to point out a typo though, Benjamin and Fridolin Roth did not use random forests, but rather vanilla/standard linear regression. Unfortunately there turned out to be a mistake in the data set construction, it was not properly randomized.
As such Benjamin and Fridolin, no fault to them, stumbled upon this error and used it as the primary input to their model. This error happens to be equivalent to knowing a large fraction of the answer, future edits, before predicting it. Unfortunately this means that their model is invalid and can’t be used by Wikipedia to understand participation. See the following link for more detail:
http://www.kaggle.com/forums/t/980/wikipedia-participation-challenge-an-unfortunate-ending