Wikimedia blog

News from inside the Wikimedia Foundation.org

Data analytics

Techies learn, make, win at Foundation’s first San Francisco hackathon

Participants at the San Francisco hackathon in 2012

Participants at the San Francisco hackathon in January 2012

In January, 92 participants gathered in San Francisco to learn about Wikimedia technology and to build things in our first Bay Area hackathon.

After a kickoff speech by Foundation VP of Engineering Erik Möller (video), we led tutorials on the MediaWiki web API, customizing wikis with JavaScript user scripts and Gadgets, and building the Wikipedia Android app.  (We recorded each training; click those links for how-to guides and videos.)  We asked the participants to self-organize into teams and work on projects.  After their demonstration showcase, judges awarded a few prizes to the best demos.

(more…)

Do It Yourself Analytics with Wikipedia

As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

  • Wikihadoop
  • Diffdb
  • WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.

We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

  • Who has added template X in the last month?
  • Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Volume of contributions by registered users on the English Wikipedia until December 2010, colored by account age

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.

If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!

Diederik van Liere, Analytics Team

Data analytics at Wikimedia Foundation

This post is a follow-on to my previous post “What is Platform Engineering?” .  In this post, I’ll describe the history of our analytics work, talk about how we derive and distribute our statistics, and ask you to join us in building our platform.  Summary:  we’re hiring, and we want to tell you what a great opportunity this is.

Our Data Analytics team is responsible for building out our logging and data mining infrastructure, and for making Wikimedia-related statistics useful to other parts of the Foundation and the movement.  Up until fairly recently, Erik Zachte has been the main analytics person for Wikimedia (with support from many generalists here), working first as a volunteer building stats.wikimedia.org, then on behalf of Wikimedia Foundation starting in 2008.  It started off as a large number of detailed page view and editor statistics about all Wikimedia wikis, large and small, and has since been augmented to include various summary formats and visualizations.  As the movement has grown, it has played an increasingly important role in helping guide our investments.

Announcing the WikiChallenge Winners

Wikipedia Participation Challenge

Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.

The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.

We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!

Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that

“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”

We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.

The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.

All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.

What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.

We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.

We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.

Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.

Diederik van Liere
External Consultant, Wikimedia Foundation

Howie Fung
Senior Product Manager, Wikimedia Foundation

2011-10-26: Edited to correct description of the winning algorithm

Three weeks left in the Wikipedia Participation Challenge

There are still three weeks left in the Wikipedia Participation Challenge (see prior blog post)!  So far, the competition has exceeded our expectations.  As of this morning, 78 teams (167 total individuals) from across the world have participated in the competition, with a total of 735 entries submitted. Half of these teams have beat the benchmark we set at the beginning of the competition, which is a testament to the quality of the teams and their submissions.   We can’t wait to see what great algorithms the participants are developing.

There’s still time to jump in before the competition closes on September 20, 2011, so if you haven’t done so, download the data and start crunching.  And those who want to cheer from the sidelines may follow the competition on Kaggle’s leaderboard.

Howie Fung
Senior Product Manager, Wikimedia Foundation

Diederik van Liere
Research Consultant, Wikimedia Foundation

“Rate this Page” is Coming to the English Wikipedia

Since May, the Article Feedback Tool has been available on 100,000 English Wikipedia articles (see blog post). We have now kicked off full deployment to the English Wikipedia at a rate of about 370,000 articles per day and will continue at this rate until deployment is complete.

Ratings interface for the Article Feedback Tool

We wanted to take a moment to briefly recap what we’ve learned so far, what lies ahead, and how we can work with the community improve this feature.   Features like Article Feedback can always be improved, so we will continue to experiment, measure, and iterate based on user and community feedback, testing, and analysis of how the feature is being used.

Rating data from the tool is available for your analysis — please dig in and let us know what you find. Toolserver developers can also access the rating data (minus personal information) in real-time to develop new dashboards and views of the data.

What We’ve Learned So Far

 

Readers like to provide feedback. The survey we’re currently running shows that over 90% of users find the ratings useful.  Many of these raters see the tool as a way to participate in article development — when asked why they rated and article, over half reported wanting to “positively affect the development of the page.”

 

Users of the feedback tool also left some enthusiastic comments (as well as some critical ones) about the tool. For example:

The option to rate a page should be available on every page, all the time, once per page per user per day.

As a high school librarian, I want my students to assess the sources of information they use.  This feature forces them to consider the reliability of Wiki articles.  Glad you have it.

Ratings seem like an interesting idea, I feel like the metrics used to determine the overall value of the page are viable, and I’ll be interested to see how the feature fares when it’s rolled out and has some miles under its belt.

The vast majority of raters were previously only readers of Wikipedia.  Of the registered users that rated an article, 66% had no prior editing activity.  For these registered users, rating an article represents their first participatory activity on Wikipedia.  These initial results show that we are starting to engage these users beyond just passive reading, and they seem to like it.

The feature brings in editors. One of the main Strategic Goals for the upcoming year is to increase the number of active editors contributing to WMF projects.  The initial data from the Article Feedback tool suggests that reader feedback could become a meaningful point of entry for future editors.

Once users have successfully submitted a rating, a randomly selected subset of them are shown an invitation to edit the page. Of the users that were invited to edit, 17% attempted to edit the page.  15% of those ended up successfully completing an edit.  These results strongly suggest that a feedback tool could successfully convert passive readers into active contributors of Wikipedia.  A rich text editor could make this path to editing even more promising.

While these initial results are certainly encouraging, we need to assess whether these editors are, in fact, improving Wikipedia.  We need to measure their level of activity, the quality of their contributions, their longevity, and other characteristics.

Ratings are a useful measure of some dimensions of quality.  In its current form, the Article Feedback Tool appears to provide useful feedback on some dimensions of quality, while the usefulness of the feedback on other dimensions of quality is still an open research question. Completeness and Trustworthy (formerly “Well-Sourced”) appear to be dimensions where readers can provide a reasonable level of assessment.  Research shows that ratings along these dimensions are correlated with the length and amount of citations, respectively.  We need to determine whether the ratings in “Objective” and “Well-Written” meaningfully predict quality in those categories. We released public dumps of AFT data and would love to hear about new approaches of measuring how well ratings reflect article quality.

We received feedback from community members on how to improve the feature. We’ve received a fair amount of feedback from the community on the usefulness of AFT, mainly through IRC Office Hours and on the AFT discussion page.  There have been many suggestions on how to make the feedback tool more valuable for the community.  For example, the idea of having a “Suggestions for Improvement”-type comment box has been raised several times.  Such a box would enable readers to provide concrete feedback directly to the editing community on how to improve an article.  We plan to develop some kind of commenting system in the near future.

Illustration of a potential "Suggest Improvements" feature

AFT could help surface problematic articles in real time, as well as articles that may qualify for increased visibility. We’ve started experimenting with a dashboard for surfacing both highly rated and lowly rated articles.   Ultimately, the dashboard could help identify articles that need attention (e.g., articles that have been recently vandalized) as well as articles that might be considered for increased visibility (e.g., candidates for Featured Articles).  We will continue to experiment with algorithms that help surface trends in articles that may be useful for the editing community.

Next Steps

Over the coming weeks, we will continue to roll out the Article Feedback Tool on the English Wikipedia.  Once this rollout is complete, we will start planning the next version of the tool.  For those interested in following the discussion, we will be documenting progress on the Article Feedback Project Page.  We would love to get your feedback (pun intended!) on how the feature is being used, what’s working, and what might be changed.  We also encourage folks to dig into the data.  Once the feature is fully deployed, there will be mountains of data to sift through and analyze, which will be a boon to researchers and developers alike.

We’d especially like to encourage members of the community to get involved in the further development of the feature.  If you’re interested in getting involved (e.g., design input, data analysis/interpretation, bug-squashing, etc.), please drop a note on the project talk page.

Howie Fung, Senior Product Manager

Dario Taraborelli, Senior Research Analyst

Erik Moeller, VP Engineering and Product Development

Data Competition: Announcing the Wikipedia Participation Challenge

We are pleased to announce the launch of the Wikipedia Participation Challenge, a data modeling competition to develop an algorithm that predicts future editing activity on Wikipedia. The competition is hosted by Kaggle, a platform for data modeling and prediction competitions.  The Participation Challenge is open to community members and anyone else who is interested in analyzing Wikipedia data.  This is the first of two data competitions the Wikimedia Foundation will sponsor this year.

The goal of this competition is to gain a better understanding of the factors that encourage or discourage people from editing Wikipedia. Increasing the number of active editors is one of our strategic priorities. Both the Wikipedia communities and the Wikimedia Foundation stand to benefit from models that quantify the factors that determine whether a Wikipedia editor is likely to continue contributing. The competition asks contestants to develop a model to predict the number of edits a given editor will make in six month’s time.

The data used in this competition comes from the publicly available English Wikipedia XML data dump.  An anonymous donor has generously contributed $10,000 as prize money. There will be a Grand Prize for the best prediction, as well as special prizes awarded for the use of open source software. The Grand Prize winner will also be given the opportunity to present their prediction model at the 2011 IEEE International Conference on Data Mining.  The competition starts today and will continue until September 20, 2011.

Head over to our competition portal, download the data, and start crunching the data! And don’t forget to follow us on Twitter: #wikichallenge and @dvanliere.

Howie Fung
Senior Product Manager, Wikimedia Foundation

Diederik van Liere
Research Consultant, Wikimedia Foundation

News about the Bookshelf Project and new direction for Fellowship

My time as a Fellow of the Wikimedia Foundation has been divided between the Bookshelf Project and the Account Creation Improvement Project. But now my Fellowship has taken a new and exciting turn.

Since I started, the Bookshelf Project has grown steadily. We have more and more people helping out, and the number of translations is increasing every week. The brochure “Welcome to Wikipedia,” for instance, has been translated into five languages, most recently French, and it has been a popular handout for the newcomers in the Public Policy Initiative.

The Bookshelf pages have also become more easy to navigate. I have filled the pages with many of the videos and books and handouts that have been created previously, beside the new materials. In total, around 100 different works have been collected and organized on those pages. Hopefully these pages will become the go-to library for anyone who wants to find out more about Wikipedia and its sister projects. The address is easy to remember: http://bookshelf.wikimedia.org. If you have more material that you think fits in the Bookshelves, feel free to add to them.

But most exciting is this third piece of news:

Starting today, we will help you spread the Bookshelf materials. Go to the Bookshelf pages, select any material that you want to give out at a conference or event – and apply for printing money from the Wikimedia Foundation! We have a simplified grants process. See more details here: http://meta.wikimedia.org/Bookshelf Grants. With this grant we hope to help you reach out to many new individuals and inspire them to edit.

Lastly, in the upcoming weeks, the new Wikipedia Cheat Sheet will be finished. It is right now being designed and printed. If you want to translate it, we will make it very easy for you. Then you can get a grant and have it printed in just a few weeks.

This marks the end of my full-time engagement in the Bookshelf project, but I will still check in on it now and then, and I love to see what happens with it in the future.

What happens next?

One of the proposed designs for the new account creation processes.

One of the proposed designs for the new account creation processes.

With all of these milestones reached, my Fellowship changes. I will concentrate fully on the Account Creation Improvement Project. After having performed surveys and tests of the account creation process, we have discovered that this project can have a real impact on the number of people with new accounts that actually start to edit. So with a few volunteers and support from the tech staff, I have started working on an approach to the next tests. First we have set up a new tracking system. Now we are working on creating two high-quality account creation processes that are significantly different from the existing process. We will start testing these very soon and see how many more new users we can get to the point of starting to edit.

After about a month of testing, this will lead to an increased understanding of what we can do to get the new users to stay. Of course, we would love to have your input and ideas.

Lennart Guldbrandsson
Community Fellow

New interactive visualization shows global distribution of Wikipedia edits

Wikimedia Data Analyst Erik Zachte recently unveiled a new interactive visualization showing the global distribution of edits for various language editions of Wikipedia.

The animation shows a global map of edits made on May 10, 2011.

The animation shows a global map of edits made on May 10, 2011.

This first version allows users to see where edits are coming from for a given day. Right now, the day is fixed but fairly recent.

You can control the parameters of this interactive visualization by using keyboard shortcuts available in a “Help” menu (press ‘H’). For example, ‘E’ switches between different event markers.

Hit 'M' to switch to a black background, and 'E' to switch between different styles of event markers. Here, language codes are shown instead of bubbles.

Hit 'M' to switch to a black background, and 'E' to switch between different styles of event markers. Here, language codes are shown instead of bubbles.

The data behind these graphics comes from our Squid logs, that usually record about 400,000 edits a day. See Erik’s post to read more about how the visualizations were made.

By zooming on a particular area (‘+’ or mouse scroll), or filtering the edits by language (‘N’ or space bar), interesting things can surface. For example, bubble maps and heat maps reflect densely populated areas with easy Internet access.

Hit 'N' or the Space bar to display a specific language. Here, edits to the English Wikipedia are shown on a bubble map ('2').

Hit 'N' or the Space bar to display a specific language. Here, edits to the English Wikipedia are shown on a bubble map ('2').

Three types of displays are available, all showing the spatial distribution of edits over time in a different way: an accelerated animation of edits over a day (’1′), a bubble map of the same edits over a day (’2′), and a heat map of edits over a day (’3′).

The animation over the course of the day also shows the levels of activity depending on the time it is in various timezones. Compare for example the activity of the Spanish Wikipedia in Spain and Latin America over the course of the day.

Hit '3' to switch to the heat map of edits combined on a single day. This heat map shows edits to the Spanish Wikipedia, mostly distributed in Spain and Latin America.

Hit '3' to switch to the heat map of edits combined on a single day. This heat map shows edits to the Spanish Wikipedia, mostly distributed in Spain and Latin America.

Similarly, the map shows that most edits to the Chinese Wikipedia are made from outside of mainland China (Hong Kong and Taiwan):

Zoom in using the + key, or the mouse scroll.

Zoom in using the + key, or the mouse scroll.

Open the visualization and play with it yourself!

In the tradition of free software that Wikimedia is attached to, this visualization was entirely created using HTML5 (canvas) and JavaScript, and no proprietary tool is necessary to view the animation.

The visualization works in the most recent browsers. If for some reason it doesn’t work for you, below is a short video to give you an overview of what it looks like when animated.

The video is also available on Wikimedia Commons, along with more screenshots.


Guillaume Paumier

Account Creation Improvement Project Update

As you may know from Sue’s March 2011 update, the Wikimedia Foundation has made it one of our highest priorities to improve the experience of new editors, and we thought we’d start right at the beginning: from when a potentially new editor makes an account.

The Wikimedia Foundation’s Community Department has been studying how we can more effectively invite users who create new accounts to actually start editing. Since February, the Account Creation Improvement Project (ACIP) has been experimenting with different user interface messages and landing pages in the account creation flow (see their results and testing content to-date).

We didn’t have an A/B testing infrastructure that supported this work, so while ACIP has performed the first tests sequentially, we’ve now deployed a modification to our ClickTracking extension to English Wikipedia which will allow us to run multiple tests in parallel and record the results.

You’ll notice the “Log in/create account” link on the English Wikipedia will send you to several possible randomized log in screens, recognizable by the “ACP” identifier in the address.  This is from the newly created CustomUserSignup extension. Over the next few months, we’ll be varying the look and messaging of these screens to see what kind of impact that has on new editors, and sharing our findings. Our testing framework will allow us to bucket-test small tweaks to the interface and measure the number of accounts created and edits made by users (in aggregate or on a per-session basis) who have gone through different flows.

What data we are storing

We are storing a new cookie upon visiting the “Log in/create account” page, with a lifetime of three months.  This cookie will be used to track the following information:

  • Which account creation messaging group the user was placed in (identified as ACP1, ACP2 or ACP3 for now)
  • What version of the account creation campaign they recieved
  • Whether the particular user made it to the end of the account creation process, or whether they dropped off after reaching the login screen or the account creation screen
  • If (and only if) the user creates a new account, the number of edits or previews during the course of the trial

The information is associated with browser sessions (each of which has an individual unique identifier), not with an individual user or user account.

Anyone visiting the login page or the account creation page for English Wikipedia will have this
cookie set.  This is to make sure that we always provide the same wording to a particular visitor, so as not to invalidate our test.  We will stop setting this cookie at the conclusion of this work, though we will likely perform other similar tests in the future.

Because of the privacy-sensitive nature of the system, we have a limit on the level of granularity of our findings. For example, we won’t be able to create a plot of users vs edits, because we don’t have user-level data.

We look forward to the findings of the Account Creation Improvement Project, which will ultimately help us create a better sign-up experience for all users. Independent of this project, the CustomUserSignup extension may also prove useful to other outreach projects, by making it possible to create customized sign-up forms (e.g. for student workshops or e-mail invitations).

Nimish Gautam