Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Data analytics

Digging for Data: How to Research Beyond Wikimetrics

The next virtual meet-up will point out research tools. Join!!

For Learning & Evaluation, Wikimetrics is a powerful tool for pulling data for wiki project user cohorts, such as edit counts, pages created and bytes added or removed. However, you may still have a variety of other questions, for instance:

How many members of WikiProject Medicine have edited a medicine-related article in the past three months?
How many new editors have played The Wikipedia Adventure?
What are the most-viewed and most-edited articles about Women Scientists?

Questions like these and many others regarding the content of Wikimedia projects and the activities of editors and readers can be answered using tools developed by Wikimedians all over the world. These gadgets, based on publicly available data, rely on databases and Application Programming Interfaces (APIs). They are maintained by volunteers and staff within our movement.

On July 16, Jonathan Morgan, research strategist for the Learning and Evaluation team and wiki-research veteran, will begin a three-part series to explore some of the different routes to accessing Wikimedia data. Building off several recent workshops including the Wiki Research Hackathon and a series of Community Data Science Workshops developed at the University of Washington, in Beyond Wikimetrics, Jonathan will guide participants on how to expand their wiki-research capabilities by accessing data directly through these tools.

(more…)

Making Wikimedia Sites faster

Running the fifth largest website in the world brings its own set of challenges. One particularly important issue is the time it takes to render a page in your browser. Nobody likes slow websites, and we know from research that even small delays lead visitors to leave the site. An ongoing concern from both the Operations and Platform teams is to improve the reader experience by making Wikipedia and its sister projects as fast as possible. We ask ourselves questions like: Can we make Wikipedia 20% faster on half the planet?

As you can imagine, the end-user experience differs greatly due to our unique diverse and global readership. Hence, we need to conduct real user monitoring to truly get an understanding of how fast our projects are in real-life situations.

But how do we measure how fast a webpage loads? Last year, we started building instrumentation to collect anonymous timing data from real users, through a MediaWiki extension called NavigationTiming.[1]

There are many factors that determine how fast a page loads, but here we will focus on the effects of network latency on page speed. Latency is the time it takes for a packet to travel from the originating server to the client who made the request.

ULSFO

Earlier this year, our new data center (ULSFO) went fully operational, serving content to Oceania, South-East Asia, and the west coast of North America[2]. The main benefit of this work is shaving up to 70−80ms of round-trip time for some regions of Oceania, East Asia, US and Canada. An area with 360 million Internet users and a total population of approximately one billion people.

We recently explained how we chose which areas to serve from the new data center; knowing the sites became faster for those users was not enough for us, we wanted to know how much faster.

Results

Before we talk about specific results, it is important to understand that having faster network round trip times might not directly result in a faster user experience for users. When network times are faster, resources are retrieved faster, but there are many other factors that influence page latency. This is perhaps better explained with an example: If we need 4 network trips to compose a page, and if round trips 2, 3 and 4 are happening while I am parsing a huge main document (round trip 1), I will only see improvements from the first request. Subsequent ones are done in parallel and totally hidden under the fetching of the first one. In this scenario, our bottleneck for performance is the parsing of the first resource. Not the network time.

With that in mind, what we wanted to know when we analyzed the data from the NavigationTiming extension were two things: How much did our network times improve? and Can users feel the effect of faster network times? Are pages perceived to be faster, and if so, how much?

The data we harvest from the NavigationTiming extension is segregated by country. Thus we concentrated our data analysis on countries in Asia for which we had sufficient data points; we also included the United States and Canada but we were not able to extract data just for the western states. Data for United States and Canada was analyzed at a country level and thus the improvements in latency appear “muffled”.

How much did our network times improve?

The short summary is: network times improved quite a bit. For half of requests, the retrieval of the main document decreased up to 70 ms.

ULSFO Improvement of Network times on Wikimedia Sites

In the opposite graph, the data center rollout is marked with a dashed line. The rollout was gradual, thus gains are not perceived immediately but they are very significant after a few days. The graph includes data for Japan, Korea and the whole SE Asia Region.[3]

We graphed the responseStart–connectStart time which represents the time spent in the network until the first byte arrives, minus the time spent in DNS lookups. For a more visual explanation, take a look at the Navigation timing diagram. If there is a TCP connection drop, the time will include the setup of the new connection. All the data we use to measure network improvements is provided by request timing API, and thus not available on IE8 and below.

User perceived latency

Did the improvement of network times have an impact that our users could see? Well, yes it did. More so for some users than others.

The gains in Japan and Indonesia were remarkable, page load times dropped up to 300ms at the 50th percentile (weekly). We saw smaller (but measurable) improvements of 40 ms in the US too. However, we were not able to measure the impact in Canada.

The dataset we used to measure these improvements is a bigger one than the one we had for network times. As we mentioned before, the Navigation Timing API is not present in old browsers, thus we cannot measure, say, network improvement in IE7. In this case, however, we used a measure of our creation that tells us when a page is done loading called mediaWikiLoadComplete. This measure is taken in all browsers when the page is ready to interact with the user; faster times do mean that the user experience was also faster. Now, how users perceive the improvement has a lot to do with how fast pages were to start with. If a page now takes 700 ms to render instead of one second, any user will be able to see the difference. However a difference of 300 ms in a 4 second page rendering will be unnoticed by most.

Reduction in latency

Want to know more?

Want to know all the details? A (very) detailed report of the performance impact of the ULSFO rollout is available.

Next steps

Improving speed is an ongoing concern, particularly as we roll out new features and we want to make sure that page rendering remains fast. We are keeping our eyes open to new ways of reducing latency, for example by evaluating TCP Fast Open. TCP Fast Open skips an entire round-trip and starts sending data from the server to client before the final acknowledgment of the three-way TCP handshake has been finished.

We are also getting closer to deploying HipHop. HipHop is a virtual machine that compiles PHP bytecode to native instructions at runtime, the same strategy used by Java and C# to achieve their speed advantages. We’re quite confident that this will result in big performance improvements on our sites as well.

We wish you speedy times!

Faidon Liambotis
Ori Livneh
Nuria Ruiz
Diederik van Liere

Notes

  1. The NavigationTiming extension is built on top of the HTML5 component with same name which exposes fine-grained measurements from the moment a user submits a request to load a page until the page has been fully loaded.
  2. Countries and provinces served by ULSFO include: Bangladesh, Bhutan, Hong Kong, Indonesia, Japan, Cambodia, Democratic People’s Republic of Korea, Republic of Korea, Myanmar, Mongolia, Macao, Malaysia, Philippines, Singapore, Thailand, Taiwan, Vietnam, US Pacific/West Coast states (Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, New Mexico, Nevada, Oregon, Utah, Washington, Wyoming) and Canada’s western territories (Alberta, British Columbia, Northwest Territories, Yukon Territory).
  3. Countries include: Bangladesh, Bhutan, Hong Kong, Indonesia, Japan, Cambodia, Democratic People’s Republic of Korea, Republic of Korea, Myanmar, Mongolia, Macao, Malaysia, Philippines, Singapore, Thailand, Taiwan, Vietnam.

A Collaborative Definition of Impact: Building Metrics Together

Voting wall at metrics brainstorming session, Berlin 2014.

What do metrics not tell us?

As part of the Wikimedia Conference in Berlin, on Thursday, April 10, members of the WMF Grantmaking department’s Learning and Evaluation team held a brainstorming session around metrics with chapter representatives from around the world. The aim of the session was to start a conversation around what the evaluation metrics piloted in the (beta) Evaluation Reports tell us about our current programs and what they do not tell us, in terms of program impact.

Sharing evaluation information across the movement helps program leaders all over the world benefit from each others know-how and strategies for program design. Evaluation metrics are important tools to help make decisions like, how much time and how many resources should I invest? Every program has at least one purpose or goal behind it, and having a systematic way to measure the results of those goals helps program leaders to better tell the story of their programs; what worked, what didn’t, why or why not.

During the brainstorming session, we worked in two groups, one focused on image upload based programs, the other focused on text-centered programs, to start to answer three big questions:

  • What outcomes and story do the pilot metrics bring forward?
  • Where are there gaps in the story, or what outcomes do the pilot metrics not measure?
  • How else might we measure the outcomes that are not yet included in the story?

(more…)

What are readers looking for? Wikipedia search data now available

(Update 9/20 17:40 PDT)  It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

I am very happy to announce the availability of anonymous search log files for Wikipedia and its sister projects, as of today. Collecting data about search queries is important for at least three reasons:

  1. it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
  2. we can improve our search index by benchmarking improvements against real queries.
  3. we give outside researchers the opportunity to discover gems in the data.

Peter Youngmeister (Ops team) and Andrew Otto (Analytics team) have worked diligently over the past few weeks to start collecting search queries. Every day from today, we will publish the search queries for the previous day at: http://dumps.wikimedia.org/other/search/ (we expect to have a 3 month rolling window of search data available).

Each line in the log files is tab separated and it contains the following fields:

  1. Server hostname
  2. Timestamp (UTC)
  3. Wikimedia project
  4. URL encoded search query
  5. Total number of results
  6. Lucene score of best match
  7. Interwiki result
  8. Namespace (coded as integer)
  9. Namespace (human-readable)
  10. Title of best matching article

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. You can download a sample file. We collect data from both from the search box on a wiki page after the visitor submits the query, and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, this generates too much data.

Anonymous means that there is nothing in the data that allows you to map a query to an individual user: there are no IP addresses, no editor names, and not even anonymous tokens in the dataset. We also discard queries that contain email addresses, credit card numbers and social security numbers.

It’s our hope that people will use this data to build innovative applications that highlight topics that Wikipedia is currently not covering, improve our Lucene parser or uncover other hidden gems within the data. We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy, and the new data could help to give it a little bit of much-needed TLC. If you’ve got search chops then have a look at our Lucene external contractor position.

We are making this data available under a CC0 license: this means that you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. But we do appreciate it if you cite us when you use this data source for your research, experimentation or product development.

Finally, please consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Diederik van Liere, Product Manager Analytics

(Update 9/19 20:20 PDT) We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.

Improving the accuracy of the active editors metric

We are making a change to our active editor metric to increase accuracy, by eliminating double-counting and including Wikimedia Commons in the total number of active editors. The active editors metric is a core metric for both the Wikimedia Foundation and the Wikimedia communities and is used to measure the overall health of the different communities. The total number of active editors is defined as:

the number of editors with the same registered username across different Wikimedia projects who made at least 5 edits in countable namespaces in a given month and are not registered as a bot user.

This is a conservative definition, but helps us to assess the size of the core community of contributors who update, add to and maintain Wikimedia’s projects.

The de-duplication consists of two changes:

  1. The total active editor count now includes Wikimedia Commons (increasing the count).
  2. Editors with the same username on different projects are counted as a single editor (decreasing the count).

The net result of these two changes is a decrease of the number of total active editors averaging 4.4% over last 3 years.

De-duplication of the active editor count only affects our total number of active editors across the different Wikimedia projects, the counts within a single project are unaffected. We’ve also begun work on a data glossary as a canonical reference point for all key metrics used by the Wikimedia Foundation.

Background

(more…)

Meet the Analytics Team

Over the past few months, the Wikimedia Foundation has been gearing up a variety of new initiatives, and measuring success has been on our minds. It should come as no surprise that we’ve been building an Analytics Team at the same time. We are excited to finally introduce ourselves and talk about our plans.

The team is currently a pair of awesome engineers, David Schoonover and Andrew Otto, a veteran data-analyst, Erik Zachte, and one humble product manager, Diederik van Liere. (We happen to be looking for a JavaScript engineer — if beautiful, data-driven client apps are your thing – or you know someone, drop us a line!)

We’ve got quite a few projects under way (and many more ideas), and we’d like to briefly go over them — expect more posts in the future with deeper details on each.

First up: a revamp of the Wikimedia Report Card. This dashboard gives an overview of key metrics representing the health and success of the movement: pageviews, unique visitors, number of active editors, and the like.

Illustration of the revamped Reportcard

The new report card is powered by Limn, a pure JavaScript GUI visualization toolkit we wrote. We wanted non-technical community members to be able to interact with the data directly, visualizing and exploring it themselves, rather than relying on us or analysts to give them a porthole into the deep. As a drop-in component, we hope it will contribute to democratizing data analysis (though we plan to use it extensively across projects ourselves). So play around with the report card data, or fork the project on GitHub!

Kraken: A Data Services Platform

But we have bigger plans. Epic plans. Mythical plans. A generic computational cluster for data analytics, which we affectionately call Kraken: a unified platform to aggregate, store, analyze, and query all incoming data of interest to the community, built so as to keep pace with our movement’s ample motivation and energy.

How many Android users are there in India that visit more than ten times per month? Is there a significant difference in the popularity of mobile OS’s between large cities and rural areas of India? Do Portuguese and Brazilian readers favour different content categories? How often are GLAM pictures displayed off-site, outside of Wikipedia (and where)?

As it stands, answering any of these questions is, at best, tedious and hard. Usually, it’s impossible. The size of the success of Wikimedia projects is a double-edged sword, in that it makes even modest data analysis a significant task. This is something we aim to fix with Kraken.

More urgently, however, we don’t presently have infrastructure to do A/B testing, measure the impact of outreach projects, or give editors insight into the readers they reach with their contributions. From this view, the platform is a robust, unified toolkit for exploring these data streams, as well as a means of providing everyone with better information for evaluating the success of features large and small.

This points toward our overarching vision. Long-term, we aim to give the Wikimedia movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity’s knowledge to power applications, mash up into websites, and stream to devices.

Dream big!

Privacy: Counting not Tracking

The Kraken is a mythical Nordic monster with many tentacles, much like any analytics system: analytics touches everything — from instrumenting mobile apps to new user conversion analysis to counting parser cache lookups — and it needs a big gaping maw to keep up with all the data coming in. Unfortunately, history teaches us that mythical cephalopods aren’t terribly good at privacy. We aim to change that.

We’ve always had a strong commitment to privacy. Everything we store is covered by the Foundation’s privacy policy. Nothing we’re talking about here changes those promises. Kraken will be used to count stuff, not to track user behaviour. But in order to count, we need to store and we want you all to have a good idea of what we’re collecting and why we’re collecting it and we will be specific and transparent about that. We aim to be able to answer a multitude of questions using different data sources. Counts of visitors, page and image views, search queries and number of edits and new user registrations are just a few of the data streams currently planned; each will be annotated with metadata to make it easier to query. To take a few more examples: page views will be tagged to indicate which come from bots. Traffic from mobile phones will be tagged as mobile. By counting these different types of events and adding these kind of meta tags, we will be able to better measure our progress towards the Strategic Plan.

We’ll be talking a lot more about the technical details of the system we’re building, so check back in case you’re interested or reach out to us if you want to provide feedback about how to best use the data to answer lots of interesting questions while still preserving users’ privacy. This post only scratches the surface, but we’ve got lots more to discuss.

Talk to Us!

Sound exciting? Have questions, ideas, or suggestions? Well then! Consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Excited, and have engineering chops? Well then! We’re looking for a stellar engineer to help build a fast, intuitive, and beautiful toolkit for visualizing and understanding all this data. Check out the Javascript/UI Engineer job posting to learn more.

We’re definitely excited about where things are going, and we are looking forward to keeping you all up to speed on all our new developments.

Finally, we are hosting our first Analytics IRC office hours! Join us on July 30th, at 12pm PDT (3pm EDT / 9pm CEST) in #wikimedia-analytics to ask all your analytics and statistics related questions about Wikipedia and the other Wikimedia projects.

Best regards,

David Schoonover, Analytics Engineer
Andrew Otto, Analytics Engineer
Erik Zachte, Data Analyst
Diederik van Liere, Product Manager

US Education Program participants add three times as much quality content as regular new users

Wikipedia Education Program participants from the United States added more than three times as much quality content as regular new users, a quantitative analysis shows.

In the Wikipedia Education Program, professors assign their students to edit Wikipedia articles as a grade for class, assisted by volunteer Wikipedia Ambassadors. In fall 2011, 55 courses participated in the program in the United States, with students editing articles on the English Wikipedia. On average, these students added 1855 bytes of content that stayed on Wikipedia, compared to only 491 for a randomly chosen sample of new users who joined English Wikipedia in September 2011. These numbers establish that students who participate in the Wikipedia Education Program contribute significantly more quality content that stays on Wikipedia than other new users.

Examining the distribution of content that survived on Wikipedia for both of these groups, we found that almost half of the Wikipedia Education Program participants added 1,000 or more bytes that stayed on Wikipedia in the first six months. In contrast, more than half of the random sample of new editors added no content that stayed on Wikipedia in the first six months. The targeted recruitment of students, combined with the support provided by the Ambassador Program and instructors, results in a much larger percentage of new editors who contribute quality content to Wikipedia.

To understand the collective impact of the Wikipedia Education Program in fall 2011, we compared the amount of content students added to Wikipedia to the content added by the random sample of new editors. The numbers show that the 920 student editors who participated in the program in fall 2011 added the same amount of content as 2250 typical new editors (editors are defined as users who made at least one edit to an article). In terms of new content, students have twice the impact as typical new editors.

An important consideration for any outreach project is editor retention. Data showed that students who are introduced to editing Wikipedia through the U.S. Education Program are just as likely to continue editing as any other newcomer.

The Wikipedia Education Program has now grown to Egypt, Brazil and other regions beyond North America. With an increased global presence, measuring and understanding the contributions of new student editors (and how they differ from other new users that join Wikipedia) has gained importance. Establishing a common metric for measuring the impact of the Wikipedia Education Program on various Wikipedias is another key motivation for a quantitative study.

There’s a lot more work to be done on measuring the program’s impact. So, stay tuned for more information about these metrics.

Methodology for this research can be found at: http://meta.wikimedia.org/wiki/Research:Wikipedia_Education_Program_evaluation#Methods

Ayush Khanna, Data Analyst, Global Development

(with input from Mani Pande, Head of Global Development Research)

Techies learn, make, win at Foundation’s first San Francisco hackathon

Participants at the San Francisco hackathon in 2012

Participants at the San Francisco hackathon in January 2012

In January, 92 participants gathered in San Francisco to learn about Wikimedia technology and to build things in our first Bay Area hackathon.

After a kickoff speech by Foundation VP of Engineering Erik Möller (video), we led tutorials on the MediaWiki web API, customizing wikis with JavaScript user scripts and Gadgets, and building the Wikipedia Android app.  (We recorded each training; click those links for how-to guides and videos.)  We asked the participants to self-organize into teams and work on projects.  After their demonstration showcase, judges awarded a few prizes to the best demos.

(more…)

Do It Yourself Analytics with Wikipedia

As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

  • Wikihadoop
  • Diffdb
  • WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.

We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

  • Who has added template X in the last month?
  • Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Volume of contributions by registered users on the English Wikipedia until December 2010, colored by account age

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.

If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!

Diederik van Liere, Analytics Team

Data analytics at Wikimedia Foundation

This post is a follow-on to my previous post “What is Platform Engineering?” .  In this post, I’ll describe the history of our analytics work, talk about how we derive and distribute our statistics, and ask you to join us in building our platform.  Summary:  we’re hiring, and we want to tell you what a great opportunity this is.

Our Data Analytics team is responsible for building out our logging and data mining infrastructure, and for making Wikimedia-related statistics useful to other parts of the Foundation and the movement.  Up until fairly recently, Erik Zachte has been the main analytics person for Wikimedia (with support from many generalists here), working first as a volunteer building stats.wikimedia.org, then on behalf of Wikimedia Foundation starting in 2008.  It started off as a large number of detailed page view and editor statistics about all Wikimedia wikis, large and small, and has since been augmented to include various summary formats and visualizations.  As the movement has grown, it has played an increasingly important role in helping guide our investments.