# Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

# Vote for the most exciting paper from nine years of research about Wikipedia

(This is a guest post by Carol Ann O’Hare of Wikimedia France.)

The impact of collaborative writing on the quality of Wikipedia content, new methods for monitoring contributions in order to fight vandalism, how the nature and quality of content depends on contributors’ status and the area covered, etc. These topics concern the Wikimedians who write and use Wikipedia… but also more and more researchers!

By launching an international award for research on Wikimedia projects and free knowledge, Wikimédia France wants to highlight these research works, encourage them and especially, make them understandable and accessible to the Wikimedia community.

Starting in July, the first step was to ask the community of researchers that study Wikimedia projects to nominate scientific papers that they consider the most influential and important from the years 2003 to 2011. We collected more than 30 proposals, each satisfying the selection criteria: Available under open access and published in peer-reviewed publications. It is thanks to a quality jury, composed of researchers working on these topics, that we could select five finalists papers among these. You can find summaries and full texts linked below:

To decide the winner, Wikimédia France wishes to encourage all Wikimedians to give their opinion and vote for the paper that seems the most stimulating and relevant.

Voting will close on Monday, March 11. The announcement of the winning paper is scheduled for the end of March. The authors will receive a grant of €2,500. They can freely allocate this sum, provided it is dedicated to help open knowledge research.

Carol Ann O’Hare
Wikimedia France

# Wikimedia Research Newsletter, January 2013

Vol: 3 • Issue: 1 • January 2013

Lessons from the research literature on open collaboration; clicks on featured articles; credibility heuristics

With contributions by: Taha Yasseri, Piotrus, Aaron Shaw, Tbayer and Lui8E

### Lessons from the wiki research literature in “American Behavioral Scientist” special issue

A special issue of the American Behavioral Scientist is devoted to “open collaboration”.

• Consistent patterns found in Wikipedia and other open collaborations: In the introductory piece[1], researchers Andrea Forte and Cliff Lampe give an overview of this field, defined as the study of “distributed, collaborative efforts made possible because of changes in information and communication technology that facilitate cooperative activities” – with open source projects and Wikipedia among the most prominent examples. They point out that “[b]y now, thousands of scholars have written about open collaboration systems, many hundreds of thousands of people have participated in them, and millions of people use products of open collaboration every day.” Among their “lessons from the literature”, they name three “consistent patterns” found by researchers of open collaborations:
• “Participation Is Unequal” (meaning that some participants contribute vastly more than others: “In Wikipedia, for example, it has long been shown that a few editors provide the bulk of contributions to the site.”)
• “There Are Special Requirements for Socializing New Users”
• “Users Are Massively Heterogeneous in Both How and Why They Participate”
• “Ignore All Rules” as “tension release mechanism”: The abstract of paper titled “Rules and Roles vs. Consensus: Self-Governed Deliberative Mass Collaboration Bureaucracies” [2] explains “Wikipedia’s unusual policy, ignore all rules (IAR)” as a “tension release mechanism” that is “reconciling the tension between individual agency and collective goals” by “[supporting] individual agency when positions taken by participants might conflict with those reflected in established rules. Hypotheses are tested with Wikipedia data regarding individual agency, bureaucratic processes, and IAR invocation during the content exclusion process. Findings indicate that in Wikipedia each utterance matters in deliberations, rules matter in deliberations, and IAR citation magnifies individual influence but also reinforces bureaucracy.”
• Collaboration on articles about breaking news matures more quickly: “Hot Off the Wiki: Structures and Dynamics of Wikipedia’s Coverage of Breaking News Events”[3] analyzes “Wikipedia articles about over 3,000 breaking news events, [investigating] the structure of interactions between editors and articles”, finding that “breaking articles emerge into well-connected collaborations more rapidly than nonbreaking articles, suggesting early contributors play a crucial role in supporting these high-tempo collaborations.” (see also our earlier review of a similarly-themed paper by the same team: “High-tempo contributions: Who edits breaking news articles?“)

A fourth paper in this special issue, titled “The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline”, found considerable media attention this month, starting with an article in USA Today. It was already reviewed in the September issue of the research report.

# Wikimedia Research Newsletter, December 2012

Vol: 2 • Issue: 12 • December 2012

Wikipedia and Sandy Hook; SOPA blackout reexamined

With contributions by: Daniel Mietchen, Piotrus, Junkie.dolphin, Taha Yasseri, Benjamin Mako Hill, Aaron Shaw, Tbayer, DarTar and Ragesoss

### How Wikipedia deals with a mass shooting

Northeastern University researcher Brian Keegan analyzed the gathering of hundreds of Wikipedians to cover the Sandy Hook Elementary School shooting in the immediate aftermath of the tragedy. The findings are reported in a detailed blog post that was later republished by the Nieman Journalism Lab.[1] Keegan observes that the Sandy Hook shooting article reached a length of 50Kb within 24 hours of its creation, making it the fastest growing article by length in the first day among recent articles covering mass shootings on the English-language Wikipedia. The analysis compares the Sandy Hook page with six similar articles from a list of 43 articles on shooting sprees in the US since 2007. Among the analyses described in the study, of particular interest is the dynamics of dedicated vs occasional contributors as the article reaches maturity: while in the first few hours contributions are evenly distributed with a majority of single-edit editors, after hour 3 or 4 a number of dedicated editors show up and “begin to take a vested interest in the article, which is manifest in the rapid centralization of the article”. A plot of inter-edit time also shows the sustained frequency of revisions that these articles display days after their creation, with Sandy Hook averaging at about 1 edit/minute around 24 hours since its first revision. The notebook and social network data produced by the author for the analysis are available on his website. The Nieman Journalism Lab previously covered the role that Wikipedia is playing as a platform for collaborative journalism, and why its format outperforms Wikinews with an interview of Andrew Lih published in 2010.[2] The early revision history of the Sandy Hook shooting article was also covered in a blog post by Oxford Internet Institute fellow Taha Yasseri, however with a focus on the coverage in different Wikipedia language editions.[3]

### Network positions and contributions to online public goods: the case of the Chinese Wikipedia

A graph with nodes color-coded by betweenness centrality (from red=0 to blue=max).

In a forthcoming paper in the Journal of Management Information Systems (presented earlier at HICSS ’12[4]), Xiaoquan (Michael) Zhang and Chong (Alex) Wang use a natural experiment to demonstrate that changes to the position of individuals within the editor network of a wiki modify their editing behavior. The data for this study came from the Chinese Wikipedia. In October 2005, the Chinese government suddenly blocked access to the Chinese Wikipedia from mainland China, creating an unanticipated decline in the editor population. As a result, the remaining editors found themselves in a new network structure and, the authors claim, any changes in editor behavior that ensued are likely effects of this discontinuous “shock” to the network. (more…)

# Wikimedia Research Newsletter, November 2012

Vol: 2 • Issue: 11 • November 2012

Movie success predictions, readability, credentials and authority, geographical comparisons

With contributions by: Piotrus, Benjamin Mako Hill, Tbayer, DarTar, Adler.fa, Hfordsa, Drdee

### Early prediction of movie box-office revenues with Wikipedia data

An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.

The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in The Guardian, the MIT Technology Review and the Hollywood Reporter among others. The authors observe that their approach, being “free of any language based analysis, e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other kinds of products”. The dataset used for this study, including the financial and Wikipedia activity data is available among the supplementary materials of the paper.

### Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared

$
4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43
$

The automated readability index, one of the readability metrics used in the study[2]

A study[2] by researchers at Kyoto University presents a detailed assessment of the readability of the English Wikipedia against Encyclopedia Britannica and the Simple English Wikipedia using a series of readability metrics and finds that Wikipedia “seems to lag behind the other encyclopedias in terms of readability and comprehensibility of its content”. (more…)

# In divisive times, Wikipedia brings political opponents together

Network of users communicating on Wikipedia article talk pages (Neff et al., p.22). Edges connecting two Democrats are colored blue, edges connecting two Republicans in red, and edges representing inter-party dialogue are colored in green.

Neutral Point of View – the requirement that articles must represent all significant viewpoints fairly – is one of the three core principles that Wikipedia is based on. Many of its readers value it, especially when seeking unbiased information in times of heated political battles.

In the run-up to the US presidential election, a group of six researchers from the University of Southern California and the Barcelona Media Foundation have published the results of a new study[1] showing that “despite the increasing political division of the U.S., there are still areas in which political dialogue is possible and happens” – namely, the talk pages of Wikipedia, where users of both political persuasions debate and collaborate to create encyclopedic coverage of political topics.

The research project–presented earlier this year at the 32nd INSNA (International Network for Social Network Analysis) Sunbelt conference and now documented in preprint form–conducted a quantitative analysis of the interactions of Wikipedia users who had proclaimed a political affiliation on their user page, in terms of the US political system. As the researchers write in the abstract:

“In contrast to previous analyses of other social media, we did not find strong trends indicating a preference to interact with members of the same political party within the Wikipedia community. … It seems that the shared identity of ‘being Wikipedian’ may be strong enough to triumph over other potentially divisive facets of personal identity, such as political affiliation.”

The paper’s title, “Jointly they edit,” was chosen in reference to the well-known phrase “divided we blog” coined in a 2005 paper that referred “to a trend of cyberbalkanization in the political blogosphere, with liberal and conservative blogs tending to link to other blogs with a similar political slant, and not to one another.” A similar divisive trend was found in the retweet networks on Twitter.

As a testament to what can be achieved in a fruitful collaboration between many editors including opposing political persuasions, Wikipedians have brought the articles about both contenders in tomorrow’s presidential election to “featured article” status, representing the highest quality rating on Wikipedia. The article Barack Obama has received more than 22,000 edits since it was started in March 2004, and its information is currently supported by 319 inline references. The article Mitt Romney, begun in January 2004, has been edited over 10,000 times and currently contains 400 inline references.

So no matter who gets the most electoral votes tomorrow, you can trust that many Wikipedians have worked together to ensure that his Wikipedia page will reflect a balanced political perspective.

### Reference

1. Neff J. G., Laniado D., Kappler K., Volkovich Y., Aragon P., Kaltenbrunner A. (2012). Jointly they edit: Examining the impact of community identification on political interaction in Wikipedia. arXiv:1210.6883

Tilman Bayer, Senior Operations Analyst

# Wikimedia Research Newsletter, October 2012

Vol: 2 • Issue: 10 • October 2012

WP governance informal; community as social network; efficiency of recruitment and content production; Rorschach news

With contributions by: Piotrus, Adler.fa, Bdamokos, Ragesoss, Tbayer, and Phoebe

### Wikipedia governance found to be mostly informal

A paper in the Journal of the American Society for Information Science and Technology, coming from the social control perspective and employing the repertory grid technique, has contributed interesting observations about the governance of Wikipedia.[1] The paper begins with a helpful if cursory overview of governance theories, moving towards the governance of open source communities and Wikipedia. That cursory treatment is not foolproof, though: for example, the authors mention “bazaar style governance”, but attribute it incorrectly—rather than the 2006 work they cite, the coining of this term dates to Eric S. Raymond‘s 1999 The Cathedral and the Bazaar. The authors have interviewed a number of Wikipedians and identified a number of formal and informal governance mechanisms. Only one formal mechanism was found important—the policies—while seven informal mechanisms were deemed important: collaboration among users, discussions on article talk pages, facilitation by experienced users, individuals acting as guardians of the articles, inviting individuals to participate, large numbers of editors, and participation by highly reputable users. Notably, the interviewed editors did not view elements such as administrator involvement, mediation or voting as important.

The paper concludes that “in the everyday practice of content creation, the informal mechanisms appear to be significantly more important than the formal mechanisms”, and note that this likely means that the formal mechanisms are used much more sparingly than informal ones, most likely only in the small percentage of cases where the informal mechanisms fail to provide an agreeable solution for all the parties. It was stressed that not all editors are equal, and certain editors (and groups) have much more power than others, a fact that is quickly recognized by all editors. The authors note the importance of transparent interactions in spaces like talk pages, and note that “the reported use of interaction channels outside the Wikipedia platform (e.g., e-mail) is a cause for concern, as these channels limit involvement and reduce transparency.” Citing Ostrom’s governance principles, they note that “ensuring participation and transparency is crucial for maintaining the stability of self-governing communities.”

# Wikimedia Research Newsletter, September 2012

Vol: 2 • Issue: 9 • September 2012

“Rise and decline” of Wikipedia participation, new literature overviews, a look back at WikiSym 2012

With contributions by: Piotrus, Phoebe, DarTar, Benjamin Mako Hill, Ragesoss and Tbayer

### “The rise and decline” of the English Wikipedia

A paper to appear in a special issue of American Behavioral Scientist (summarized in the research index) sheds new light on the English Wikipedia’s declining editor growth and retention trends. The paper describes how “several changes that the Wikipedia community made to manage quality and consistency in the face of a massive growth in participation have lead to a more restrictive environment for newcomers”.[1] The number of active Wikipedia editors has been declining since 2007 and research examining data up to September 2009[2] has shown that the root of the problem has been the declining retention of new editors. The authors show this decline is mainly due to a decline among desirable, good-faith newcomers, and point to three factors contributing to the increasingly “restrictive environment” they face.

# Is this thing on? Giving new Wikipedians feedback post-edit

Figure 1. One of the messages used in the test (confirmation).

We recently tested a simple change in the user interface for registered Wikipedia editors. We’re happy to report results from a trial of post-edit feedback that lead to an increase in the productivity of newcomers to the project, while still maintaining quality.

#### The problem

The user experience problem was fairly straightforward: Wikipedia fails to tell its new contributors that once you edit an article, your change is live and can be immediately seen by every single reader. Simple, consistent feedback to new contributors make good sense from a usability standpoint. There is also evidence from the scholarly literature that delivering feedback after successful contributions can help newcomers feel motivated to continue participating.

#### Our first test of a solution

In this test, we examined the effect of a simple confirmation message or a thank you message on new English Wikipedia editors registered between July 30 and August 6. We randomly assigned newcomers to one of these two conditions, or to a control group, and we consistently delivered the same feedback message (or none, for the control group) after every edit for the first week of activity since registration.

The results indicate that receiving feedback upon completion of an edit has a positive effect on the volume of contributions by new editors, without producing any significant side-effect on the quality of their work or whether it was kept in the encyclopedia.

We focused our analysis on a sample of 8,571 new users with at least one edit during the test period, excluding to the best of our knowledge sockpuppets and other categories of spurious accounts. We measured the effects of feedback on the volume of contributions by analyzing the number of edits and edit size per participant in the different groups; we measured the impact of the test on quality by looking at the rate of reverts and blocks per participant in the different groups.

#### Impact on edit volume

Figure 2. Log-scale box plots of edit counts of new users presented with the confirmation message (left), no message (control group, center) or the gratitude message (right) after saving an edit.

We compared the edit count of contributors by condition over the first 2 weeks of activity and found an increase in mean edit count in the two experimental conditions of about 23.5% compared to the control. The difference was marginally significant in the confirmation condition and very close to significance (p=0.052) in the gratitude condition.

We also analyzed the size of contributions by editors in each condition, by measuring edit size as bytes added, bytes removed or net bytes changed. The results indicate that both experimental conditions significantly outperformed the control in net byte count changed per edit. The confirmation condition significantly outperformed the control for positive byte count per edit, while we found a marginally significant effect for gratitude. No significant difference was observed on the negative byte count per edit (or content removal). Therefore, receiving feedback has an effect on the size of contributions by new editors compared to the content added by editors in the control condition.

See our edit volume analysis for more details.

#### Impact on quality

Figure 3. Mean success rate for edits by new users in each condition: Control group (left), confirmation message (center), gratitude message (right)

While feedback may increase the volume of newcomer edits, it might do so at the cost of decreased quality. This is concerning since increasing the amount of edits that will need to be reverted represents a burden to the current Wikipedians. To address these questions, we measured the proportion of newcomers who were eventually blocked from editing and the rate at which their contributions were rejected (reverted or deleted).

Analyzing the proportion of newcomers that were blocked since the beginning of the treatment, we found the experimental treatment had no meaningful effect on the rate at which newcomers were blocked from editing – the difference was about 7% for each group, not enough to be declared significant relative to the sample size.

We also examined the “success rate” for each user, measured as the proportion of edits that were not reverted or deleted in the first week since registration. We calculated the mean success rate per newcomer for each experimental condition and found no significant difference between either of the experimental conditions and the control (figure 3).

These results suggest that the experimental treatment had no meaningful effect on the overall quality of newcomer contributions, and therefore, the burden imposed on Wikipedians.

See our newbie quality analysis for more details.

#### What’s next

The results of this first test were promising, and we’re currently working to implement an edit confirmation message for new contributors in the current editing interface, as well as in the upcoming visual editor. However, confirmation messages or messages of gratitude are just two of many different types of feedback that could motivate new contributors.

We’re currently testing the impact of letting people know when they reach milestones based on their cumulative edit count. Some Wikipedias already have community-created service awards based on edit count and tenure, so we’re extending these awards to a newer class of contributor, by letting them know when they’ve completed their first, fifth, 10th, 25th, 50th and 100th edits to the encyclopedia.

If you’re interested in participating in the research and analysis process for tests like these, please chime in and give us your feedback. We’ll be publishing open-licensed data for these experiments, when possible, on our open data repository.

Steven Walling, Associate Product Manager
Dario Taraborelli, Senior Research Analyst
on behalf of the Editor Engagement Experiments team

# What are readers looking for? Wikipedia search data now available

(Update 9/20 17:40 PDT)  It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

I am very happy to announce the availability of anonymous search log files for Wikipedia and its sister projects, as of today. Collecting data about search queries is important for at least three reasons:

1. it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
2. we can improve our search index by benchmarking improvements against real queries.
3. we give outside researchers the opportunity to discover gems in the data.

Peter Youngmeister (Ops team) and Andrew Otto (Analytics team) have worked diligently over the past few weeks to start collecting search queries. Every day from today, we will publish the search queries for the previous day at: http://dumps.wikimedia.org/other/search/ (we expect to have a 3 month rolling window of search data available).

Each line in the log files is tab separated and it contains the following fields:

1. Server hostname
2. Timestamp (UTC)
3. Wikimedia project
4. URL encoded search query
5. Total number of results
6. Lucene score of best match
7. Interwiki result
8. Namespace (coded as integer)
10. Title of best matching article

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. You can download a sample file. We collect data from both from the search box on a wiki page after the visitor submits the query, and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, this generates too much data.

Anonymous means that there is nothing in the data that allows you to map a query to an individual user: there are no IP addresses, no editor names, and not even anonymous tokens in the dataset. We also discard queries that contain email addresses, credit card numbers and social security numbers.

It’s our hope that people will use this data to build innovative applications that highlight topics that Wikipedia is currently not covering, improve our Lucene parser or uncover other hidden gems within the data. We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy, and the new data could help to give it a little bit of much-needed TLC. If you’ve got search chops then have a look at our Lucene external contractor position.

We are making this data available under a CC0 license: this means that you can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. But we do appreciate it if you cite us when you use this data source for your research, experimentation or product development.

Finally, please consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.

Diederik van Liere, Product Manager Analytics

(Update 9/19 20:20 PDT) We’ve temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.

# Improving the accuracy of the active editors metric

We are making a change to our active editor metric to increase accuracy, by eliminating double-counting and including Wikimedia Commons in the total number of active editors. The active editors metric is a core metric for both the Wikimedia Foundation and the Wikimedia communities and is used to measure the overall health of the different communities. The total number of active editors is defined as:

the number of editors with the same registered username across different Wikimedia projects who made at least 5 edits in countable namespaces in a given month and are not registered as a bot user.

This is a conservative definition, but helps us to assess the size of the core community of contributors who update, add to and maintain Wikimedia’s projects.

The de-duplication consists of two changes:

1. The total active editor count now includes Wikimedia Commons (increasing the count).
2. Editors with the same username on different projects are counted as a single editor (decreasing the count).

The net result of these two changes is a decrease of the number of total active editors averaging 4.4% over last 3 years.

De-duplication of the active editor count only affects our total number of active editors across the different Wikimedia projects, the counts within a single project are unaffected. We’ve also begun work on a data glossary as a canonical reference point for all key metrics used by the Wikimedia Foundation.