# Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

# Wikimedia Research Newsletter, April 2013

Vol: 3 • Issue: 4 • April 2013

Sentiment monitoring; Wikipedians and academics favor the same papers; UNESCO and systemic bias; How ideas flow on Wikiversity

With contributions by: Piotr Konieczny, Oren Bochman, Taha Yasseri, Jonathan T. Morgan and Tilman Bayer

### Too good to be true? Detecting COI, Attacks and Neutrality using Sentiment Analysis

Traditional methods for detecting sentiment are less objective

Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report[1] on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.

The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.

### “A Comparative Study of Academic impact and Wikipedia Ranking”

A paper[2] with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. (more…)

# Wikimedia Research Newsletter, March 2013

Vol: 3 • Issue: 3 • March 2013

“Ignore all rules” in deletions; anonymity and groupthink; how readers react when shown talk pages

With contributions by: Amir E. Aharoni, Piotr Konieczny, Taha Yasseri, Oren Bochman, Heather Ford, Tilman Bayer, Giovanni Luca Ciampaglia, Daniel Mietchen.

### Wikipedia’s “Ignore all rules” policy (IAR) is a double edged sword in deletion arguments

A beetle larva ignoring the rules while negotiating deletion with the frog.[mediasource 1]

A paper presented at last month’s CSCW Conference, titled “Keeping eyes on the prize: officially sanctioned rule breaking in mass collaboration systems”[1] observes that “Mass collaboration systems are often characterized as unstructured organizations lacking rule and order”, yet Wikipedia has a well developed body of policies to support it as an organization. Rule breaking in bureaucracies is a slippery slope quickly leading to potentially dangerous exceptions, so Wikipedia has a mechanism called “Ignore all rules” (WP:IAR) for officially sanctioned rule breaking. The researchers have considered IAR’s impact within the scope of deletion requests. The results show that the IAR policy has meaningful influences on deliberation outcomes, which rather than wreaking havoc, provides a positive, functional governance mechanism.

This paper is another welcome addition to the growing literature on AfD, examining the effectiveness of rule breaking using WP:IAR within these discussions. It starts with an in depth extermination of rule breaking within collaborative environments. Then these six hypotheses are postulated:

1. Invocation of WP:IAR in support of vote correlates with increased likelihood of the decision that the vote will be on the winning side.
2. This effect is expected to increase with the number of policies cited in the deletion proposal (since they may be contradicting each other).
3. Invoking IAR to override the deletion proposal’s policy citation tends to reduce the proposal’s likelihood of success.
4. When IAR is used together with another policy domain (e.g. Content/Conduct/Legal) as the proposal’s rationale, it will negate the proposal’s success.
5. Increased dissonance between policies arising in the discussion will increase the chance that the IAR argument will be successful.
6. IAR will increase in effectiveness as the policies invoked increase in complexity.

# Wikimedia Research Newsletter, February 2013

Vol: 3 • Issue: 2 • February 2013

Wikipedia not so novel after all, except to UK university lecturers; EPOV instead of NPOV

With contributions by: Piotr Konieczny, Taha Yasseri, Heather Ford, Sage Ross, Daniel Mietchen and Tilman Bayer.

### Wikipedia in historic context: “Stigmergic accumulation” is not new

Page with the entry Encyclopédie from Diderot and D’Alembert’s Encyclopédie. The work was the result of the collaboration of more than 100 contributors.

“Wikipedia and Encyclopedic Production”[1] by Jeff Loveland (a historian of encyclopedias) and Joseph Reagle situates Wikipedia within the context of encyclopedic production historically, arguing that the features that many claim to be unique about Wikipedia actually have roots in encyclopedias of the past. Loveland and Reagle criticize characterizations of Wikipedia that they believe to be ahistorical and exaggerated, laying special blame with authors who compare Wikipedia’s anonymous production to Encyclopedia Britannica’s production by named experts, and thus ignore the rich tradition of encyclopedic production through the centuries. The authors then set about characterizing the history of encyclopedic production as composed of three overlapping forms: compulsive collection, stigmergic accumulation, and corporate production.

‘Compulsive collection’ refers to the work of compiling encyclopedias that has traditionally been done by a few dedicated, tireless, detail-oriented individuals. Loveland and Reagle point out that, although Wikipedians share this compulsive behavior with past encyclopedists, the crucial distinction lies in the fact that the vast majority were motivated by money (even if this motive existed alongside more idealistic motivations) whereas Wikipedia editors are unpaid.

Loveland and Reagle use the term ‘stigmergic accumulation’ to refer to the process of production by accretion onto a previous text. Even those responsible for a singly authored encyclopedia were relying on predecessors, the authors argue, ‘building on their work and using the cumulative character of texts and knowledge as a ladder of sorts’. Examples of existing texts included the use of a previous edition of an encyclopedia that ran into multiple editions, and the practice of borrowing between different encyclopedias that was sometimes illegal but more often viewed as ‘piratical’ i.e. morally wrong.

The category of ‘corporate production’ is used by Loveland and Reagle to describe the process of encyclopedic editing by a group – groups that topped a thousand contributors in the 20th century. Editors of early encyclopedias like Diderot and D’Alembert’s Encyclopédie in the 1700s faced the challenge of trying to coordinate the contributions of about 140 contributors in a similar way to Wikipedia having to confront issues of consistency that result in debates about how important a subject must be to merit an article. In contrast to other encyclopedias, write Loveland and Reagle, Wikipedia settles these debates through community decision-making and in the open. The authors also note that previous encyclopedias didn’t always recruit on the basis of expertise and that some recognized that it would be cheaper and sometimes more accurate to have non-experts summarizing the works of experts.

# A year’s worth of Wikipedia research

Twelve years after its launch, Wikipedia continues to attract a large amount of attention from scholarly research trying to understand what made this one of the most remarkable collaborative efforts in history and what makes it work. Researchers have called Wikipedia “our Everest” (because of its complexity and cultural importance) or “the Drosophila (fruit fly) of social software” (because the project’s transparency and freely available data make it accessible and popular as a research subject).

Download the complete Volume 2 (PDF)

In 2011, we launched a monthly Wikimedia Research Newsletter with the aim of covering recent academic research about Wikipedia and other Wikimedia projects. Published jointly by the Wikimedia Research Committee and the Signpost (the English Wikipedia’s community-edited newspaper), it has established itself as a comprehensive outlet enabling both researchers and Wikipedians to stay on top of current research, aiming to facilitate exchange between these two communities.

Today we are announcing the release in the public domain of a curated corpus containing the bibliographic references of all 225 publications reviewed or covered in the second volume of the newsletter, forming a historical record of Wikipedia research in the year 2012. This corpus can be browsed online or downloaded, ready to be imported into reference managers or other literature collections. Papers in this dataset have been marked as either open access or closed access .

Last year, we published a similar dataset for volume 1 (2011). Together, these releases complement other efforts to catalogue the research literature on Wikipedia, in particular the WikiLit project which focuses on publications until June 2011, prior to the launch of the newsletter.

Follow @WikiResearch for fresh Wikimedia research news

A year ago we launched the @WikiResearch news feed on Twitter and Identi.ca, covering new preprints, papers or research-related blog posts, before they are reviewed more fully in the Newsletter. As of February 2013, it has gained 745 followers and continues to be actively updated.

We also started offering the newsletter in form of an HTML email newsletter (in addition to the announcements of each new issue on the Wikiresearch-l mailing list, which only contain the table of contents). This experiment proved successful, too, with almost 100 subscribers to date (adding to the thousands of pageviews each issue receives when published as part of the Signpost, on Meta-wiki and on this blog). You can sign up to receive a copy of each new issue in your inbox as soon as it comes out.

The Newsletter is a collaborative effort and would not exist without the following 22 people who contributed reviews and summaries in 2012:

More than half of our contributors are researchers themselves, who have published about Wikipedia in peer-reviewed publications. We are also grateful for the help of several Signpost collaborators in copyediting and preparing the final publication every month.

Finally, thanks to everyone for reading the Wikimedia Research Newsletter, and please
consider contributing by pointing us to new research we should cover, or by volunteering to review new publications.

The editors of the Wikimedia Research Newsletter:

Tilman Bayer, Senior Operations Analyst
Dario Taraborelli, Senior Research Analyst

# Wikimedia Research Newsletter, January 2013

Vol: 3 • Issue: 1 • January 2013

Lessons from the research literature on open collaboration; clicks on featured articles; credibility heuristics

With contributions by: Taha Yasseri, Piotrus, Aaron Shaw, Tbayer and Lui8E

### Lessons from the wiki research literature in “American Behavioral Scientist” special issue

A special issue of the American Behavioral Scientist is devoted to “open collaboration”.

• Consistent patterns found in Wikipedia and other open collaborations: In the introductory piece[1], researchers Andrea Forte and Cliff Lampe give an overview of this field, defined as the study of “distributed, collaborative efforts made possible because of changes in information and communication technology that facilitate cooperative activities” – with open source projects and Wikipedia among the most prominent examples. They point out that “[b]y now, thousands of scholars have written about open collaboration systems, many hundreds of thousands of people have participated in them, and millions of people use products of open collaboration every day.” Among their “lessons from the literature”, they name three “consistent patterns” found by researchers of open collaborations:
• “Participation Is Unequal” (meaning that some participants contribute vastly more than others: “In Wikipedia, for example, it has long been shown that a few editors provide the bulk of contributions to the site.”)
• “There Are Special Requirements for Socializing New Users”
• “Users Are Massively Heterogeneous in Both How and Why They Participate”
• “Ignore All Rules” as “tension release mechanism”: The abstract of paper titled “Rules and Roles vs. Consensus: Self-Governed Deliberative Mass Collaboration Bureaucracies” [2] explains “Wikipedia’s unusual policy, ignore all rules (IAR)” as a “tension release mechanism” that is “reconciling the tension between individual agency and collective goals” by “[supporting] individual agency when positions taken by participants might conflict with those reflected in established rules. Hypotheses are tested with Wikipedia data regarding individual agency, bureaucratic processes, and IAR invocation during the content exclusion process. Findings indicate that in Wikipedia each utterance matters in deliberations, rules matter in deliberations, and IAR citation magnifies individual influence but also reinforces bureaucracy.”
• Collaboration on articles about breaking news matures more quickly: “Hot Off the Wiki: Structures and Dynamics of Wikipedia’s Coverage of Breaking News Events”[3] analyzes “Wikipedia articles about over 3,000 breaking news events, [investigating] the structure of interactions between editors and articles”, finding that “breaking articles emerge into well-connected collaborations more rapidly than nonbreaking articles, suggesting early contributors play a crucial role in supporting these high-tempo collaborations.” (see also our earlier review of a similarly-themed paper by the same team: “High-tempo contributions: Who edits breaking news articles?“)

A fourth paper in this special issue, titled “The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline”, found considerable media attention this month, starting with an article in USA Today. It was already reviewed in the September issue of the research report.

# Wikimedia Research Newsletter, December 2012

Vol: 2 • Issue: 12 • December 2012

Wikipedia and Sandy Hook; SOPA blackout reexamined

With contributions by: Daniel Mietchen, Piotrus, Junkie.dolphin, Taha Yasseri, Benjamin Mako Hill, Aaron Shaw, Tbayer, DarTar and Ragesoss

### How Wikipedia deals with a mass shooting

Northeastern University researcher Brian Keegan analyzed the gathering of hundreds of Wikipedians to cover the Sandy Hook Elementary School shooting in the immediate aftermath of the tragedy. The findings are reported in a detailed blog post that was later republished by the Nieman Journalism Lab.[1] Keegan observes that the Sandy Hook shooting article reached a length of 50Kb within 24 hours of its creation, making it the fastest growing article by length in the first day among recent articles covering mass shootings on the English-language Wikipedia. The analysis compares the Sandy Hook page with six similar articles from a list of 43 articles on shooting sprees in the US since 2007. Among the analyses described in the study, of particular interest is the dynamics of dedicated vs occasional contributors as the article reaches maturity: while in the first few hours contributions are evenly distributed with a majority of single-edit editors, after hour 3 or 4 a number of dedicated editors show up and “begin to take a vested interest in the article, which is manifest in the rapid centralization of the article”. A plot of inter-edit time also shows the sustained frequency of revisions that these articles display days after their creation, with Sandy Hook averaging at about 1 edit/minute around 24 hours since its first revision. The notebook and social network data produced by the author for the analysis are available on his website. The Nieman Journalism Lab previously covered the role that Wikipedia is playing as a platform for collaborative journalism, and why its format outperforms Wikinews with an interview of Andrew Lih published in 2010.[2] The early revision history of the Sandy Hook shooting article was also covered in a blog post by Oxford Internet Institute fellow Taha Yasseri, however with a focus on the coverage in different Wikipedia language editions.[3]

### Network positions and contributions to online public goods: the case of the Chinese Wikipedia

A graph with nodes color-coded by betweenness centrality (from red=0 to blue=max).

In a forthcoming paper in the Journal of Management Information Systems (presented earlier at HICSS ’12[4]), Xiaoquan (Michael) Zhang and Chong (Alex) Wang use a natural experiment to demonstrate that changes to the position of individuals within the editor network of a wiki modify their editing behavior. The data for this study came from the Chinese Wikipedia. In October 2005, the Chinese government suddenly blocked access to the Chinese Wikipedia from mainland China, creating an unanticipated decline in the editor population. As a result, the remaining editors found themselves in a new network structure and, the authors claim, any changes in editor behavior that ensued are likely effects of this discontinuous “shock” to the network. (more…)

# Wikimedia Research Newsletter, November 2012

Vol: 2 • Issue: 11 • November 2012

Movie success predictions, readability, credentials and authority, geographical comparisons

With contributions by: Piotrus, Benjamin Mako Hill, Tbayer, DarTar, Adler.fa, Hfordsa, Drdee

### Early prediction of movie box-office revenues with Wikipedia data

An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.

The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in The Guardian, the MIT Technology Review and the Hollywood Reporter among others. The authors observe that their approach, being “free of any language based analysis, e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other kinds of products”. The dataset used for this study, including the financial and Wikipedia activity data is available among the supplementary materials of the paper.

### Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared

$
4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43
$

The automated readability index, one of the readability metrics used in the study[2]

A study[2] by researchers at Kyoto University presents a detailed assessment of the readability of the English Wikipedia against Encyclopedia Britannica and the Simple English Wikipedia using a series of readability metrics and finds that Wikipedia “seems to lag behind the other encyclopedias in terms of readability and comprehensibility of its content”. (more…)

# Wikimedia Research Newsletter, October 2012

Vol: 2 • Issue: 10 • October 2012

WP governance informal; community as social network; efficiency of recruitment and content production; Rorschach news

With contributions by: Piotrus, Adler.fa, Bdamokos, Ragesoss, Tbayer, and Phoebe

### Wikipedia governance found to be mostly informal

A paper in the Journal of the American Society for Information Science and Technology, coming from the social control perspective and employing the repertory grid technique, has contributed interesting observations about the governance of Wikipedia.[1] The paper begins with a helpful if cursory overview of governance theories, moving towards the governance of open source communities and Wikipedia. That cursory treatment is not foolproof, though: for example, the authors mention “bazaar style governance”, but attribute it incorrectly—rather than the 2006 work they cite, the coining of this term dates to Eric S. Raymond‘s 1999 The Cathedral and the Bazaar. The authors have interviewed a number of Wikipedians and identified a number of formal and informal governance mechanisms. Only one formal mechanism was found important—the policies—while seven informal mechanisms were deemed important: collaboration among users, discussions on article talk pages, facilitation by experienced users, individuals acting as guardians of the articles, inviting individuals to participate, large numbers of editors, and participation by highly reputable users. Notably, the interviewed editors did not view elements such as administrator involvement, mediation or voting as important.

The paper concludes that “in the everyday practice of content creation, the informal mechanisms appear to be significantly more important than the formal mechanisms”, and note that this likely means that the formal mechanisms are used much more sparingly than informal ones, most likely only in the small percentage of cases where the informal mechanisms fail to provide an agreeable solution for all the parties. It was stressed that not all editors are equal, and certain editors (and groups) have much more power than others, a fact that is quickly recognized by all editors. The authors note the importance of transparent interactions in spaces like talk pages, and note that “the reported use of interaction channels outside the Wikipedia platform (e.g., e-mail) is a cause for concern, as these channels limit involvement and reduce transparency.” Citing Ostrom’s governance principles, they note that “ensuring participation and transparency is crucial for maintaining the stability of self-governing communities.”

# Wikimedia Research Newsletter, September 2012

Vol: 2 • Issue: 9 • September 2012

“Rise and decline” of Wikipedia participation, new literature overviews, a look back at WikiSym 2012

With contributions by: Piotrus, Phoebe, DarTar, Benjamin Mako Hill, Ragesoss and Tbayer

### “The rise and decline” of the English Wikipedia

A paper to appear in a special issue of American Behavioral Scientist (summarized in the research index) sheds new light on the English Wikipedia’s declining editor growth and retention trends. The paper describes how “several changes that the Wikipedia community made to manage quality and consistency in the face of a massive growth in participation have lead to a more restrictive environment for newcomers”.[1] The number of active Wikipedia editors has been declining since 2007 and research examining data up to September 2009[2] has shown that the root of the problem has been the declining retention of new editors. The authors show this decline is mainly due to a decline among desirable, good-faith newcomers, and point to three factors contributing to the increasingly “restrictive environment” they face.

# Wikimedia Research Newsletter, August 2012

Vol: 2 • Issue: 8 • August 2012

New influence graph visualizations; NPOV and history; ‘low-hanging fruit’

With contributions by: Piotrus, Ragesoss, Evan, DarTar, Tbayer and OrenBochman

### Wikipedia-based graphs visualize influences between thinkers, writers and musicians

A visualization of musical genres related to psychedelic music, based on DBPedia data.

In a blog post titled “Graphing the history of philosophy”,[1] Simon Raper of the company MindShare UK describes how he constructed an influence graph of all philosophers using the “Influenced by” and “Influenced” fields of Template:Infobox philosopher (example: Plato). This information was retrieved using DBpedia with a simple SPARQL query. After some cleanup, the result, consisting of triplets in the form <Philosopher A, Philosopher B, Weight> was processed using the open source graph visualization package Gephi to create an impressive overview of the philosophers within their respective spheres of influence.

Brendan Griffen extended the idea to “everyone on Wikipedia. Well, everyone with an infobox containing ‘influences’ and/or ‘influenced by’”, arriving at a huge, far more dense “Graph Of Ideas” including not only philosophers, but also novelists, fantasy and science fiction writers, and comedians.[2] In another blog post,[3] Griffen added transitive links as well – so that each person is considered to be influenced both directly and indirectly. The most connected people in the graph were ancient Greek thinkers, with Thales, Pythagoras and Zeno of Elea occupying the top three spots. Griffen remarks that this vindicates a statement in Bertrand Russell‘s History of Western Philosophy (1945): “Western Philosophy begins With Thales”.