Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

Posts Tagged ‘data’

Digging for Data: How to Research Beyond Wikimetrics

The next virtual meet-up will point out research tools. Join!!

For Learning & Evaluation, Wikimetrics is a powerful tool for pulling data for wiki project user cohorts, such as edit counts, pages created and bytes added or removed. However, you may still have a variety of other questions, for instance:

How many members of WikiProject Medicine have edited a medicine-related article in the past three months?
How many new editors have played The Wikipedia Adventure?
What are the most-viewed and most-edited articles about Women Scientists?

Questions like these and many others regarding the content of Wikimedia projects and the activities of editors and readers can be answered using tools developed by Wikimedians all over the world. These gadgets, based on publicly available data, rely on databases and Application Programming Interfaces (APIs). They are maintained by volunteers and staff within our movement.

On July 16, Jonathan Morgan, research strategist for the Learning and Evaluation team and wiki-research veteran, will begin a three-part series to explore some of the different routes to accessing Wikimedia data. Building off several recent workshops including the Wiki Research Hackathon and a series of Community Data Science Workshops developed at the University of Washington, in Beyond Wikimetrics, Jonathan will guide participants on how to expand their wiki-research capabilities by accessing data directly through these tools.


What are readers looking for? Wikipedia search data now available

(Update 9/20 17:40 PDT)  It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

I am very happy to announce the availability of anonymous search log files for Wikipedia and its sister projects, as of today. Collecting data about search queries is important for at least three reasons:

  1. it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
  2. we can improve our search index by benchmarking improvements against real queries.
  3. we give outside researchers the opportunity to discover gems in the data.

Peter Youngmeister (Ops team) and Andrew Otto (Analytics team) have worked diligently over the past few weeks to start collecting search queries. Every day from today, we will publish the search queries for the previous day at: (we expect to have a 3 month rolling window of search data available).

Each line in the log files is tab separated and it contains the following fields:

  1. Server hostname
  2. Timestamp (UTC)
  3. Wikimedia project
  4. URL encoded search query
  5. Total number of results
  6. Lucene score of best match
  7. Interwiki result
  8. Namespace (coded as integer)
  9. Namespace (human-readable)
  10. Title of best matching article

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. You can download a sample file. We collect data from both from the search box on a wiki page after the visitor submits the query, and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, this generates too much data.

Anonymous means that there is nothing in the data that allows you to map a query to an individual user: there are no IP addresses, no editor names, and not even anonymous tokens in the dataset. We also discard queries that contain email addresses, credit card numbers and social security numbers.

It’s our hope that people will use this data to build innovative applications that highlight topics that Wikipedia is currently not covering, improve our Lucene parser or uncover other hidden gems within the data. We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy, and the new data could help to give it a little bit of much-needed TLC. If you’ve got search chops then have a look at our Lucene external contractor position.

We are making this data available under a CC0 license: this means that you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. But we do appreciate it if you cite us when you use this data source for your research, experimentation or product development.

Finally, please consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Diederik van Liere, Product Manager Analytics

(Update 9/19 20:20 PDT) We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.

You have new messages: improving communication on Wikipedia

You have new messages
Every month, hundreds of thousands of people press the edit button on Wikipedia for the very first time. And for many of these new users, the first (and sometimes only) message that appears on their user talk page is a template rather than a human response. This is especially true on our larger, older projects.

User talk page templates were developed by the community because of the tremendous volume of contributions that began pouring in as Wikipedia grew more and more popular. Today, with the focus of our movement shifting to openness and attracting new editors, it’s time to rethink the message we’re sending via templates.

That’s why Steven Walling and I have started a project to A/B test many of the template messages received by new users, such as warnings and deletion notices. In collaboration with over 20 members of the Wikimedia community, including the English and Portuguese Wikipedias so far, we’ve designed a number of experiments that will give us tangible data to improve communication on the projects.

How it works

With the help of tools developed by our summer researchers, different messages we want to test are randomly delivered to different groups of users. Tracking the data from these two groups, we can assess the efficacy of different kinds of messages, based on whether users continue to edit constructively after receiving them.

Our working hypothesis, which we are continuing to test and refine, is that making templates more personal will help retain the good-faith editors who receive them, while continuing to detract vandals, spammers, and other bad-faith editors. For both groups, showing them that the encyclopedia is built through the hard work of other people like them is key.

What you can do

There are thousands of different user talk page templates on Wikimedia projects. We need your help to construct and carry out more tests, especially in non-English communities!

Please visit our task force page on English Wikipedia or our interlanguage hub on Meta and sign up. You can add your project to the list if you’re interested in starting new tests.

This is the first time that the Wikimedia Foundation has devoted resources to helping test and improve the template infrastructure the community uses every day to function. We hope that together, we can significantly improve the way Wikimedia projects communicate with editors.

Thank you,
Maryana Pinchuk and Steven Walling

How much do new editors actually improve Wikipedia?

Does a constant stream of new editors really make Wikipedia better? Increasing participation is one of the top five priorities in our strategic plan. But when we talk about retention of newly registered editors, some readers and experienced editors rightfully wonder exactly how many edits by newbies actually improve the free encyclopedia.

In the Community Department, we’re facilitating the WikiGuides pilot program on the English Wikipedia to reach out to new contributors and mentor them. To do that successfully, we must quickly identify which new editors are actually doing good work.

So one of our working questions is: How many contributions by new editors are made in good faith and are worth retaining or improving?

We took a randomly selected batch of 155 new registered users on the English Wikipedia who made at least one edit in mid-April of this year. We looked at their first edit and ranked it on a 1-5 scale, with 1 being pure vandalism and 5 being an edit that is excellent, meaning it adds a significant chunk of verified, encyclopedic content and would be indistinguishable from a very experienced editor. Here’s what that composition looks like:

So you can see that even with a very high standard for quality — we only handed out a single “5” edit — most new editors made contributions worth retaining in some way, even if they weren’t perfect. More than half of these first edits needed no reworking to be acceptable based on current Wikipedia policy. Another 19% made good faith edits but needed additional help to meet standards defined in policy or guideline.

In order to investigate whether this has changed over time, we took a similar cohort from the same period in April 2004 and made the same qualitative assessment.

The key thing to note in comparing the two samples is that the percent of acceptable edits made by newbies did not dramatically decrease from 2004 to 2011. That’s despite the fact that the bar for quality has been raised over time, and that there are arguably fewer obvious contributions to make now that Wikipedia has grown by millions of articles.

Another relevant fact to consider is that while both cohorts are of 155 new editors, it took several days for that many new editors to join Wikipedia in 2004. In 2011, our sample is a tiny slice of the new editors arriving every month. For example: on Monday of this week more than 1,800 editors joined English Wikipedia and made at least one edit. On the equivalent day in 2004 there were only about 60.

Our sample strongly suggests that thousands of new editors still join Wikipedia every month with valuable contributions to make. Ensuring that we welcome these newcomers and show them the ropes is a top priority for ensuring Wikipedia’s continued success in our second decade.

(This is the first in what will be a new series of blog posts coming out of the Community Department at the Wikimedia Foundation. Starting now and continuing through the summer, we will be sharing the questions, experiments, and fresh data that currently drive our work. While you’ll get an inside look at what we’re doing, our numbers and analysis are still evolving and should be taken with a grain of salt.)

Steven Walling
Wikimedia Foundation Fellow, on behalf of the Community Dept. – especially Philippe Beaudette, James Alexander, and Maryana Pinchuk.