Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Data analytics

Data Competition: Announcing the Wikipedia Participation Challenge

We are pleased to announce the launch of the Wikipedia Participation Challenge, a data modeling competition to develop an algorithm that predicts future editing activity on Wikipedia. The competition is hosted by Kaggle, a platform for data modeling and prediction competitions.  The Participation Challenge is open to community members and anyone else who is interested in analyzing Wikipedia data.  This is the first of two data competitions the Wikimedia Foundation will sponsor this year.

The goal of this competition is to gain a better understanding of the factors that encourage or discourage people from editing Wikipedia. Increasing the number of active editors is one of our strategic priorities. Both the Wikipedia communities and the Wikimedia Foundation stand to benefit from models that quantify the factors that determine whether a Wikipedia editor is likely to continue contributing. The competition asks contestants to develop a model to predict the number of edits a given editor will make in six month’s time.

The data used in this competition comes from the publicly available English Wikipedia XML data dump.  An anonymous donor has generously contributed $10,000 as prize money. There will be a Grand Prize for the best prediction, as well as special prizes awarded for the use of open source software. The Grand Prize winner will also be given the opportunity to present their prediction model at the 2011 IEEE International Conference on Data Mining.  The competition starts today and will continue until September 20, 2011.

Head over to our competition portal, download the data, and start crunching the data! And don’t forget to follow us on Twitter: #wikichallenge and @dvanliere.

Howie Fung
Senior Product Manager, Wikimedia Foundation

Diederik van Liere
Research Consultant, Wikimedia Foundation

News about the Bookshelf Project and new direction for Fellowship

My time as a Fellow of the Wikimedia Foundation has been divided between the Bookshelf Project and the Account Creation Improvement Project. But now my Fellowship has taken a new and exciting turn.

Since I started, the Bookshelf Project has grown steadily. We have more and more people helping out, and the number of translations is increasing every week. The brochure “Welcome to Wikipedia,” for instance, has been translated into five languages, most recently French, and it has been a popular handout for the newcomers in the Public Policy Initiative.

The Bookshelf pages have also become more easy to navigate. I have filled the pages with many of the videos and books and handouts that have been created previously, beside the new materials. In total, around 100 different works have been collected and organized on those pages. Hopefully these pages will become the go-to library for anyone who wants to find out more about Wikipedia and its sister projects. The address is easy to remember: http://bookshelf.wikimedia.org. If you have more material that you think fits in the Bookshelves, feel free to add to them.

But most exciting is this third piece of news:

Starting today, we will help you spread the Bookshelf materials. Go to the Bookshelf pages, select any material that you want to give out at a conference or event – and apply for printing money from the Wikimedia Foundation! We have a simplified grants process. See more details here: http://meta.wikimedia.org/Bookshelf Grants. With this grant we hope to help you reach out to many new individuals and inspire them to edit.

Lastly, in the upcoming weeks, the new Wikipedia Cheat Sheet will be finished. It is right now being designed and printed. If you want to translate it, we will make it very easy for you. Then you can get a grant and have it printed in just a few weeks.

This marks the end of my full-time engagement in the Bookshelf project, but I will still check in on it now and then, and I love to see what happens with it in the future.

What happens next?

One of the proposed designs for the new account creation processes.

One of the proposed designs for the new account creation processes.

With all of these milestones reached, my Fellowship changes. I will concentrate fully on the Account Creation Improvement Project. After having performed surveys and tests of the account creation process, we have discovered that this project can have a real impact on the number of people with new accounts that actually start to edit. So with a few volunteers and support from the tech staff, I have started working on an approach to the next tests. First we have set up a new tracking system. Now we are working on creating two high-quality account creation processes that are significantly different from the existing process. We will start testing these very soon and see how many more new users we can get to the point of starting to edit.

After about a month of testing, this will lead to an increased understanding of what we can do to get the new users to stay. Of course, we would love to have your input and ideas.

Lennart Guldbrandsson
Community Fellow

New interactive visualization shows global distribution of Wikipedia edits

Wikimedia Data Analyst Erik Zachte recently unveiled a new interactive visualization showing the global distribution of edits for various language editions of Wikipedia.

The animation shows a global map of edits made on May 10, 2011.

The animation shows a global map of edits made on May 10, 2011.

This first version allows users to see where edits are coming from for a given day. Right now, the day is fixed but fairly recent.

You can control the parameters of this interactive visualization by using keyboard shortcuts available in a “Help” menu (press ‘H’). For example, ‘E’ switches between different event markers.

Hit 'M' to switch to a black background, and 'E' to switch between different styles of event markers. Here, language codes are shown instead of bubbles.

Hit 'M' to switch to a black background, and 'E' to switch between different styles of event markers. Here, language codes are shown instead of bubbles.

The data behind these graphics comes from our Squid logs, that usually record about 400,000 edits a day. See Erik’s post to read more about how the visualizations were made.

By zooming on a particular area (‘+’ or mouse scroll), or filtering the edits by language (‘N’ or space bar), interesting things can surface. For example, bubble maps and heat maps reflect densely populated areas with easy Internet access.

Hit 'N' or the Space bar to display a specific language. Here, edits to the English Wikipedia are shown on a bubble map ('2').

Hit 'N' or the Space bar to display a specific language. Here, edits to the English Wikipedia are shown on a bubble map ('2').

Three types of displays are available, all showing the spatial distribution of edits over time in a different way: an accelerated animation of edits over a day (’1′), a bubble map of the same edits over a day (’2′), and a heat map of edits over a day (’3′).

The animation over the course of the day also shows the levels of activity depending on the time it is in various timezones. Compare for example the activity of the Spanish Wikipedia in Spain and Latin America over the course of the day.

Hit '3' to switch to the heat map of edits combined on a single day. This heat map shows edits to the Spanish Wikipedia, mostly distributed in Spain and Latin America.

Hit '3' to switch to the heat map of edits combined on a single day. This heat map shows edits to the Spanish Wikipedia, mostly distributed in Spain and Latin America.

Similarly, the map shows that most edits to the Chinese Wikipedia are made from outside of mainland China (Hong Kong and Taiwan):

Zoom in using the + key, or the mouse scroll.

Zoom in using the + key, or the mouse scroll.

Open the visualization and play with it yourself!

In the tradition of free software that Wikimedia is attached to, this visualization was entirely created using HTML5 (canvas) and JavaScript, and no proprietary tool is necessary to view the animation.

The visualization works in the most recent browsers. If for some reason it doesn’t work for you, below is a short video to give you an overview of what it looks like when animated.

The video is also available on Wikimedia Commons, along with more screenshots.


Guillaume Paumier

Account Creation Improvement Project Update

As you may know from Sue’s March 2011 update, the Wikimedia Foundation has made it one of our highest priorities to improve the experience of new editors, and we thought we’d start right at the beginning: from when a potentially new editor makes an account.

The Wikimedia Foundation’s Community Department has been studying how we can more effectively invite users who create new accounts to actually start editing. Since February, the Account Creation Improvement Project (ACIP) has been experimenting with different user interface messages and landing pages in the account creation flow (see their results and testing content to-date).

We didn’t have an A/B testing infrastructure that supported this work, so while ACIP has performed the first tests sequentially, we’ve now deployed a modification to our ClickTracking extension to English Wikipedia which will allow us to run multiple tests in parallel and record the results.

You’ll notice the “Log in/create account” link on the English Wikipedia will send you to several possible randomized log in screens, recognizable by the “ACP” identifier in the address.  This is from the newly created CustomUserSignup extension. Over the next few months, we’ll be varying the look and messaging of these screens to see what kind of impact that has on new editors, and sharing our findings. Our testing framework will allow us to bucket-test small tweaks to the interface and measure the number of accounts created and edits made by users (in aggregate or on a per-session basis) who have gone through different flows.

What data we are storing

We are storing a new cookie upon visiting the “Log in/create account” page, with a lifetime of three months.  This cookie will be used to track the following information:

  • Which account creation messaging group the user was placed in (identified as ACP1, ACP2 or ACP3 for now)
  • What version of the account creation campaign they recieved
  • Whether the particular user made it to the end of the account creation process, or whether they dropped off after reaching the login screen or the account creation screen
  • If (and only if) the user creates a new account, the number of edits or previews during the course of the trial

The information is associated with browser sessions (each of which has an individual unique identifier), not with an individual user or user account.

Anyone visiting the login page or the account creation page for English Wikipedia will have this
cookie set.  This is to make sure that we always provide the same wording to a particular visitor, so as not to invalidate our test.  We will stop setting this cookie at the conclusion of this work, though we will likely perform other similar tests in the future.

Because of the privacy-sensitive nature of the system, we have a limit on the level of granularity of our findings. For example, we won’t be able to create a plot of users vs edits, because we don’t have user-level data.

We look forward to the findings of the Account Creation Improvement Project, which will ultimately help us create a better sign-up experience for all users. Independent of this project, the CustomUserSignup extension may also prove useful to other outreach projects, by making it possible to create customized sign-up forms (e.g. for student workshops or e-mail invitations).

Nimish Gautam

Open Web Analytics 1.4

Open Web Analytics 1.4.0rc3 is out!  You probably don’t care, do you?  You should!  At least we do!

Anyway, let’s start in the beginning:

As we strategized about future development of Wikimedia properties, it became abundantly clear that the measurement tools that we have are insufficient to make the decisions we need to make.  This was a key recommendation from the Strategy task force. We evaluated several possible analytics frameworks as a supplement or even replacement for our homegrown system(s).  After evaluating a couple of open source solutions (while keeping an open mind about the possible need to go with a proprietary solution), we decided to try out Open Web Analytics (OWA) for this year’s fundraiser, with the goal of evaluating it for broader use.

OWA is a PHP-based analytics tool which provides very sophisticated capabilities for real-time data analysis, providing many tools offered by proprietary counterparts. For us, OWA seems to hit the right balance of flexibility and scalability, with the added benefit that there was already an integration plugin for MediaWiki.  Over the past few months, we’ve been working with Peter Adams, the designer of OWA, to adapt OWA for our needs and to make sure that it would work at the scale that we operate at.

Many of the features in the 1.4 release were made initially for our use, but are general-purpose features that many OWA users should be able to benefit from.  We wanted to track how successful we were at getting people from banners, to letter, to donation, so Peter added a couple of features called “conversion goal tracking” and “goal funnels” which will help us figure out where people might be dropping off, but can also be used for general conversion analysis on any OWA-enabled site.  We also needed to keep track of all of this on a per-banner basis, as well as knowing whether the user clicked on the banner or on the “Donate” link in the sidebar, so the “campaign tracking” feature was added.

Finally, we needed to deploy many instances of OWA, so clustered deployment was added in this release.  Peter worked with Nimish Gautam here at WMF to make OWA more scalable, with Nimish becoming a committer on OWA. Peter focused on the architecture, while Nimish focused on making sure that all of the work integrated seamlessly into Wikimedia’s environment.

We’ve just deployed OWA for purposes of observing traffic patterns for the fundraiser, and we’ll be reporting on how well it works for us.  We’re not using all of the features; for example, we’ve disabled features such as mouse movement recording/playback.  We’re being very careful to respect everyone’s privacy and stay true to the WMF donor privacy policy and the Wikimedia privacy policy

We believe the work we’ve done is generally applicable to anyone who wants MediaWiki analytics, and we’re eager to see how it works for others.  We are also at a point where we would love help with testing this.

Wikipedia’s Volunteer Story

What’s happening to Wikipedia’s volunteer community? Earlier this week, the Wall Street Journal reported that “Volunteers Log Off as Wikipedia Ages”. The article is a comprehensive description of the challenges and opportunities facing the Wikipedia community. Among other things, it describes recent research findings regarding the number of Wikipedia editors. A quote from the article: “In the first three months of 2009, the English-language Wikipedia suffered a net loss of more than 49,000 editors, compared to a net loss of 4,900 during the same period a year earlier, according to Spanish researcher Felipe Ortega.”

Other news stories have further focused on this particular number, some going so far to predict Wikipedia’s imminent demise, others highlighting its strengths and resilience. It’s understandable that media will look for a compelling narrative. Our job is to arrive at a nuanced understanding of what’s going on. This blog post is therefore an attempt to dig deeper into the numbers and into what’s happening with Wikipedia’s volunteer community, and to describe our big picture strategy.

In a nutshell, here’s what we know:

  • The number of people reading Wikipedia continues to grow.  In October, we had 344 million unique visitors from around the world, according to comScore Media Metrix, up 6% from September.  Wikipedia is the fifth most popular web property in the world.
  • The number of articles in Wikipedia keeps growing.  There are about 14.4 million articles in Wikipedia, with thousands of new ones added every day.
  • The number of people writing Wikipedia peaked about two and a half years ago, declined slightly for a brief period, and has remained stable since then.  Every month, some people stop writing, and every month, they are replaced by new people.

The numbers quoted in the Wall Street Journal are the result of analysis by Spanish researcher Dr. Felipe Ortega. Dr. Ortega has conducted valuable research on a wide range of aspects of the projects hosted by the Wikimedia Foundation.  It is, however, important to understand the meaning of the cited numbers.  Dr. Ortega’s findings are described in his doctoral thesis “Wikipedia: A quantitative analysis.”

First, it’s important to note that Dr. Ortega’s study of editing patterns defines as an editor anyone who has made a single edit, however experimental. This results in a total count of three million editors across all languages.  In our own analytics, we choose to define editors as people who have made at least 5 edits. By our narrower definition, just under a million people can be counted as editors across all languages combined.  Both numbers include both active and inactive editors.  It’s not yet clear how the patterns observed in Dr. Ortega’s analysis could change if focused only on editors who have moved past initial experimentation.

Even more importantly, the findings reported by the Wall Street Journal are not a measure of the number of people participating in a given month. Rather, they come from the part of Dr. Ortega’s research that attempts to measure when individual Wikipedia volunteers start editing, and when they stop. Because it’s impossible to make a determination that a person has left and will never edit again, there are methodological challenges with determining the long term trend of joining and leaving: Dr. Ortega qualifies as the editor’s “log-off date” the last time they contributed. This is a snapshot in time and doesn’t predict whether the same person will make an edit in the future, nor does it reflect the actual number of active editors in that month.

Dr. Ortega supplements this research with data about the actual participation (number of changes, number of editors) in the different language editions of our projects. His findings regarding actual participation are generally consistent with our own, as well as those of other researchers such as Xerox PARC’s Augmented Social Cognition research group.

What do those numbers show?  Studying the number of actual participants in a given month shows that Wikipedia participation as a whole has declined slightly from its peak 2.5 years ago, and has remained stable since then. (See WikiStats data for all Wikipedia languages combined.) On the English Wikipedia, the peak number of active editors (5 edits per month) was 54,510 in March 2007. After a more significant decline by about 25%, it has been stable over the last year at a level of approximately 40,000. (See WikiStats data for the English Wikipedia.) Many other Wikipedia language editions saw a rise in the number of editors in the same time period. As a result the overall number of editors on all projects combined has been stable at a high level over recent years. We’re continuing to work with Dr. Ortega to specifically better understand the long-term trend in editor retention, and whether this trend may result in a decrease of the number of editors in the future.

Let’s move on to the bigger picture.

The mission of the Wikimedia Foundation, a non-profit organization, is to ensure that every single human being can share in the sum of all knowledge. Both the health and growth of our volunteer community are key to succeeding in that endeavor. This is why the Wikimedia Foundation works with researchers from around the world to understand what is happening in its projects, supports comprehensive analytics work, and is pursuing long term initiatives to recruit new editors and support the development of its communities:

  • Our usability initiative is making it easier to contribute to Wikipedia and its sister projects by improving the underlying open source technology. Removing barriers is key to recruiting new editors.
  • Our outreach initiative is developing a comprehensive set of training and outreach materials that will help us to recruit new volunteer editors.
  • Our strategic planning initiative is a unique community-driven process to identify how we can maximize our impact. One of its task forces is specifically studying community health.

Wikimedia chapter organizations around the world are supporting our technology work, our outreach initiatives, and strategic partnerships; their activities are documented in the archive of chapter reports.

The Wikimedia volunteer community is also engaged in important discussions and experiments. A community-initiated project in the English Wikipedia, for example, tried to assess the typical experience of new Wikipedia editors when trying to contribute useful content. This newbie treatment study is directly informing community discussions about community processes. Similar experiments and large strategic discussions are happening in other languages.

These discussions and projects are important. Wikimedia is a unique global volunteer movement to share what we know, to make and keep it available. We need your help and your participation in these initiatives – please follow the above links and get involved.

We want more people to join us, to edit Wikipedia to make it richer and better and more comprehensive. We don’t know what the “perfect” number of Wikipedia volunteers is, but we do know that we want to significantly increase it from where it is today.

In addition to direct volunteer participation, Wikimedia depends on public support. If you share our goal of bringing free knowledge to every person on the planet, please make a donation today.

Erik Moeller, Deputy Director
Erik Zachte, Data Analyst
Wikimedia Foundation

Techie ecosystem: contractors

Wikimedia’s in-house tech staff has always been assisted by a fantastic volunteer infrastructure, from which most of us have been hired over the last few years. That relationship with our community also involves maintaining some important projects via contract positions…

Erik Zachte will be maintaining and improving our site statistics — integrating new page view counts and other valuable data in with the traditional edit stats he’s maintained for some time.

Aaron Schulz is working on Flagged Revisions, improvements to the CheckUser system, and many other tasks on editing and administrative workflow.

David McCabe is coming back to polish up his LiquidThreads project for us, a more flexible way to manage discussion pages which could be a big help especially for those large, ongoing forum-style pages like the Village Pumps.

We also get some great help from David Strauss who’s been getting our fundraising data integrated more solidly into a CiviCRM system, which is replacing the multiple different versions of custom-rolled fundraising databases we’ve gone through in the past.

Brion Vibber

Chief Technical Officer