# A year’s worth of Wikipedia research

Twelve years after its launch, Wikipedia continues to attract a large amount of attention from scholarly research trying to understand what made this one of the most remarkable collaborative efforts in history and what makes it work. Researchers have called Wikipedia “our Everest” (because of its complexity and cultural importance) or “the Drosophila (fruit fly) of social software” (because the project’s transparency and freely available data make it accessible and popular as a research subject).

In 2011, we launched a monthly Wikimedia Research Newsletter with the aim of covering recent academic research about Wikipedia and other Wikimedia projects. Published jointly by the Wikimedia Research Committee and the Signpost (the English Wikipedia’s community-edited newspaper), it has established itself as a comprehensive outlet enabling both researchers and Wikipedians to stay on top of current research, aiming to facilitate exchange between these two communities.

Today we are announcing the release in the public domain of a curated corpus containing the bibliographic references of all 225 publications reviewed or covered in the second volume of the newsletter, forming a historical record of Wikipedia research in the year 2012. This corpus can be browsed online or downloaded, ready to be imported into reference managers or other literature collections. Papers in this dataset have been marked as either open access or closed access .

Last year, we published a similar dataset for volume 1 (2011). Together, these releases complement other efforts to catalogue the research literature on Wikipedia, in particular the WikiLit project which focuses on publications until June 2011, prior to the launch of the newsletter.

A year ago we launched the @WikiResearch news feed on Twitter and Identi.ca, covering new preprints, papers or research-related blog posts, before they are reviewed more fully in the Newsletter. As of February 2013, it has gained 745 followers and continues to be actively updated.

We also started offering the newsletter in form of an HTML email newsletter (in addition to the announcements of each new issue on the Wikiresearch-l mailing list, which only contain the table of contents). This experiment proved successful, too, with almost 100 subscribers to date (adding to the thousands of pageviews each issue receives when published as part of the Signpost, on Meta-wiki and on this blog). You can sign up to receive a copy of each new issue in your inbox as soon as it comes out.

The Newsletter is a collaborative effort and would not exist without the following 22 people who contributed reviews and summaries in 2012:

More than half of our contributors are researchers themselves, who have published about Wikipedia in peer-reviewed publications. We are also grateful for the help of several Signpost collaborators in copyediting and preparing the final publication every month.

Finally, thanks to everyone for reading the Wikimedia Research Newsletter, and please
consider contributing by pointing us to new research we should cover, or by volunteering to review new publications.

The editors of the Wikimedia Research Newsletter:

Tilman Bayer, Senior Operations Analyst
Dario Taraborelli, Senior Research Analyst

# Wikimedia Research Newsletter, January 2013

Vol: 3 • Issue: 1 • January 2013

Lessons from the research literature on open collaboration; clicks on featured articles; credibility heuristics

With contributions by: Taha Yasseri, Piotrus, Aaron Shaw, Tbayer and Lui8E

### Lessons from the wiki research literature in “American Behavioral Scientist” special issue

A special issue of the American Behavioral Scientist is devoted to “open collaboration”.

• Consistent patterns found in Wikipedia and other open collaborations: In the introductory piece[1], researchers Andrea Forte and Cliff Lampe give an overview of this field, defined as the study of “distributed, collaborative efforts made possible because of changes in information and communication technology that facilitate cooperative activities” – with open source projects and Wikipedia among the most prominent examples. They point out that “[b]y now, thousands of scholars have written about open collaboration systems, many hundreds of thousands of people have participated in them, and millions of people use products of open collaboration every day.” Among their “lessons from the literature”, they name three “consistent patterns” found by researchers of open collaborations:
• “Participation Is Unequal” (meaning that some participants contribute vastly more than others: “In Wikipedia, for example, it has long been shown that a few editors provide the bulk of contributions to the site.”)
• “There Are Special Requirements for Socializing New Users”
• “Users Are Massively Heterogeneous in Both How and Why They Participate”
• “Ignore All Rules” as “tension release mechanism”: The abstract of paper titled “Rules and Roles vs. Consensus: Self-Governed Deliberative Mass Collaboration Bureaucracies” [2] explains “Wikipedia’s unusual policy, ignore all rules (IAR)” as a “tension release mechanism” that is “reconciling the tension between individual agency and collective goals” by “[supporting] individual agency when positions taken by participants might conflict with those reflected in established rules. Hypotheses are tested with Wikipedia data regarding individual agency, bureaucratic processes, and IAR invocation during the content exclusion process. Findings indicate that in Wikipedia each utterance matters in deliberations, rules matter in deliberations, and IAR citation magnifies individual influence but also reinforces bureaucracy.”
• Collaboration on articles about breaking news matures more quickly: “Hot Off the Wiki: Structures and Dynamics of Wikipedia’s Coverage of Breaking News Events”[3] analyzes “Wikipedia articles about over 3,000 breaking news events, [investigating] the structure of interactions between editors and articles”, finding that “breaking articles emerge into well-connected collaborations more rapidly than nonbreaking articles, suggesting early contributors play a crucial role in supporting these high-tempo collaborations.” (see also our earlier review of a similarly-themed paper by the same team: “High-tempo contributions: Who edits breaking news articles?“)

A fourth paper in this special issue, titled “The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline”, found considerable media attention this month, starting with an article in USA Today. It was already reviewed in the September issue of the research report.

# Wikimedia Research Newsletter, December 2012

Vol: 2 • Issue: 12 • December 2012

Wikipedia and Sandy Hook; SOPA blackout reexamined

With contributions by: Daniel Mietchen, Piotrus, Junkie.dolphin, Taha Yasseri, Benjamin Mako Hill, Aaron Shaw, Tbayer, DarTar and Ragesoss

### How Wikipedia deals with a mass shooting

Northeastern University researcher Brian Keegan analyzed the gathering of hundreds of Wikipedians to cover the Sandy Hook Elementary School shooting in the immediate aftermath of the tragedy. The findings are reported in a detailed blog post that was later republished by the Nieman Journalism Lab.[1] Keegan observes that the Sandy Hook shooting article reached a length of 50Kb within 24 hours of its creation, making it the fastest growing article by length in the first day among recent articles covering mass shootings on the English-language Wikipedia. The analysis compares the Sandy Hook page with six similar articles from a list of 43 articles on shooting sprees in the US since 2007. Among the analyses described in the study, of particular interest is the dynamics of dedicated vs occasional contributors as the article reaches maturity: while in the first few hours contributions are evenly distributed with a majority of single-edit editors, after hour 3 or 4 a number of dedicated editors show up and “begin to take a vested interest in the article, which is manifest in the rapid centralization of the article”. A plot of inter-edit time also shows the sustained frequency of revisions that these articles display days after their creation, with Sandy Hook averaging at about 1 edit/minute around 24 hours since its first revision. The notebook and social network data produced by the author for the analysis are available on his website. The Nieman Journalism Lab previously covered the role that Wikipedia is playing as a platform for collaborative journalism, and why its format outperforms Wikinews with an interview of Andrew Lih published in 2010.[2] The early revision history of the Sandy Hook shooting article was also covered in a blog post by Oxford Internet Institute fellow Taha Yasseri, however with a focus on the coverage in different Wikipedia language editions.[3]

### Network positions and contributions to online public goods: the case of the Chinese Wikipedia

A graph with nodes color-coded by betweenness centrality (from red=0 to blue=max).

In a forthcoming paper in the Journal of Management Information Systems (presented earlier at HICSS ’12[4]), Xiaoquan (Michael) Zhang and Chong (Alex) Wang use a natural experiment to demonstrate that changes to the position of individuals within the editor network of a wiki modify their editing behavior. The data for this study came from the Chinese Wikipedia. In October 2005, the Chinese government suddenly blocked access to the Chinese Wikipedia from mainland China, creating an unanticipated decline in the editor population. As a result, the remaining editors found themselves in a new network structure and, the authors claim, any changes in editor behavior that ensued are likely effects of this discontinuous “shock” to the network. (more…)

# Wikimedia Research Newsletter, November 2012

Vol: 2 • Issue: 11 • November 2012

Movie success predictions, readability, credentials and authority, geographical comparisons

With contributions by: Piotrus, Benjamin Mako Hill, Tbayer, DarTar, Adler.fa, Hfordsa, Drdee

### Early prediction of movie box-office revenues with Wikipedia data

An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.

The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in The Guardian, the MIT Technology Review and the Hollywood Reporter among others. The authors observe that their approach, being “free of any language based analysis, e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other kinds of products”. The dataset used for this study, including the financial and Wikipedia activity data is available among the supplementary materials of the paper.

### Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared

$
4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43
$

The automated readability index, one of the readability metrics used in the study[2]

A study[2] by researchers at Kyoto University presents a detailed assessment of the readability of the English Wikipedia against Encyclopedia Britannica and the Simple English Wikipedia using a series of readability metrics and finds that Wikipedia “seems to lag behind the other encyclopedias in terms of readability and comprehensibility of its content”. (more…)

# Wikimedia Research Newsletter, October 2012

Vol: 2 • Issue: 10 • October 2012

WP governance informal; community as social network; efficiency of recruitment and content production; Rorschach news

With contributions by: Piotrus, Adler.fa, Bdamokos, Ragesoss, Tbayer, and Phoebe

### Wikipedia governance found to be mostly informal

A paper in the Journal of the American Society for Information Science and Technology, coming from the social control perspective and employing the repertory grid technique, has contributed interesting observations about the governance of Wikipedia.[1] The paper begins with a helpful if cursory overview of governance theories, moving towards the governance of open source communities and Wikipedia. That cursory treatment is not foolproof, though: for example, the authors mention “bazaar style governance”, but attribute it incorrectly—rather than the 2006 work they cite, the coining of this term dates to Eric S. Raymond‘s 1999 The Cathedral and the Bazaar. The authors have interviewed a number of Wikipedians and identified a number of formal and informal governance mechanisms. Only one formal mechanism was found important—the policies—while seven informal mechanisms were deemed important: collaboration among users, discussions on article talk pages, facilitation by experienced users, individuals acting as guardians of the articles, inviting individuals to participate, large numbers of editors, and participation by highly reputable users. Notably, the interviewed editors did not view elements such as administrator involvement, mediation or voting as important.

The paper concludes that “in the everyday practice of content creation, the informal mechanisms appear to be significantly more important than the formal mechanisms”, and note that this likely means that the formal mechanisms are used much more sparingly than informal ones, most likely only in the small percentage of cases where the informal mechanisms fail to provide an agreeable solution for all the parties. It was stressed that not all editors are equal, and certain editors (and groups) have much more power than others, a fact that is quickly recognized by all editors. The authors note the importance of transparent interactions in spaces like talk pages, and note that “the reported use of interaction channels outside the Wikipedia platform (e.g., e-mail) is a cause for concern, as these channels limit involvement and reduce transparency.” Citing Ostrom’s governance principles, they note that “ensuring participation and transparency is crucial for maintaining the stability of self-governing communities.”

# Wikimedia Research Newsletter, September 2012

Vol: 2 • Issue: 9 • September 2012

“Rise and decline” of Wikipedia participation, new literature overviews, a look back at WikiSym 2012

With contributions by: Piotrus, Phoebe, DarTar, Benjamin Mako Hill, Ragesoss and Tbayer

### “The rise and decline” of the English Wikipedia

A paper to appear in a special issue of American Behavioral Scientist (summarized in the research index) sheds new light on the English Wikipedia’s declining editor growth and retention trends. The paper describes how “several changes that the Wikipedia community made to manage quality and consistency in the face of a massive growth in participation have lead to a more restrictive environment for newcomers”.[1] The number of active Wikipedia editors has been declining since 2007 and research examining data up to September 2009[2] has shown that the root of the problem has been the declining retention of new editors. The authors show this decline is mainly due to a decline among desirable, good-faith newcomers, and point to three factors contributing to the increasingly “restrictive environment” they face.

# Wikimedia Research Newsletter, August 2012

Vol: 2 • Issue: 8 • August 2012

New influence graph visualizations; NPOV and history; ‘low-hanging fruit’

With contributions by: Piotrus, Ragesoss, Evan, DarTar, Tbayer and OrenBochman

### Wikipedia-based graphs visualize influences between thinkers, writers and musicians

A visualization of musical genres related to psychedelic music, based on DBPedia data.

In a blog post titled “Graphing the history of philosophy”,[1] Simon Raper of the company MindShare UK describes how he constructed an influence graph of all philosophers using the “Influenced by” and “Influenced” fields of Template:Infobox philosopher (example: Plato). This information was retrieved using DBpedia with a simple SPARQL query. After some cleanup, the result, consisting of triplets in the form <Philosopher A, Philosopher B, Weight> was processed using the open source graph visualization package Gephi to create an impressive overview of the philosophers within their respective spheres of influence.

Brendan Griffen extended the idea to “everyone on Wikipedia. Well, everyone with an infobox containing ‘influences’ and/or ‘influenced by’”, arriving at a huge, far more dense “Graph Of Ideas” including not only philosophers, but also novelists, fantasy and science fiction writers, and comedians.[2] In another blog post,[3] Griffen added transitive links as well – so that each person is considered to be influenced both directly and indirectly. The most connected people in the graph were ancient Greek thinkers, with Thales, Pythagoras and Zeno of Elea occupying the top three spots. Griffen remarks that this vindicates a statement in Bertrand Russell‘s History of Western Philosophy (1945): “Western Philosophy begins With Thales”.

# Wikimedia Research Newsletter, July 2012

Vol: 2 • Issue: 7 • July 2012

Conflict dynamics, collaboration and emotions; digitization vs. copyright; WikiProject field notes; quality of medical articles; role of readers; Best Wiki Paper Award

With contributions by: Daniel Mietchen, Junkie.dolphin, Jodi.a.schneider, Adler.fa, OrenBochman, DarTar, Benjamin Mako Hill and Tbayer

### Modeling social dynamics in a collaborative environment

A draft of a letter, submitted for publication, has been posted on ArXiv.[1] The letter reports research on modeling the process of collaborative editing in Wikipedia and similar open-collaboration writing projects. The work builds on previous research by some of its authors on conflict detection in Wikipedia. The authors explore a simple agent-based model of opinion dynamics, in which editors influence each other either by direct communication or by successively editing a shared medium, such as a Wikipedia page. According to the authors, the model, although highly idealized, exhibits a rich behavior that can reproduce, albeit only qualitatively, some key characteristics of conflicts over real-world Wikipedia pages. The authors show that, for a fixed editorial pool with one “mainstream” and two opposing “extremist” groups, consensus is always reached. However, depending on the values of the model’s input parameters, achieving consensus may take an extremely long time, and the consensus does not always conform to the initial mainstream view. In the case of a dynamic group, where new editors replace existing ones, consensus may be achieved through a phase of conflict, depending on the rate of new editors joining the editorial pool and on the degree of controversy over the article’s topic.

# Wikimedia Research Newsletter, June 2012

Vol: 2 • Issue: 6 • June 2012

Edit war patterns, deleters vs. the 1%, never used cleanup tags, authorship inequality, higher quality from central users, and mapping the wikimediasphere

With contributions by: Tbayer, Piotrus, Evan and Daniel Mietchen

### Dynamics of edit wars

Controversy about Michael Jackson as quantified on the basis of reverted edits to his Wikipedia article. A: Jackson is acquitted on all counts after five month trial. B: Jackson makes his first public appearance since the trial to accept eight records from the Guinness World Records in London, including Most Successful Entertainer of All Time. C: Jackson issues Thriller 25. D: Jackson dies in LA.

“Dynamics of Conflicts in Wikipedia”[1], develops an interesting “measure of controversiality”, something that might be of interest to editors at large if it was a more widely popularized and dynamically updated statistic. The authors look at the patterns of edit warring on Wikipedia articles, finding that edit warriors are usually prone to reaching consensus, and the rare cases of never-ending warring involve those that continuously attract new editors who have not yet joined the consensus.

Regarding methodology, the authors’ decision to filter out articles with under 100 edits as “evidently conflict-free” is a bit problematic, as articles with fewer than 100 edits have been subject to clear, if not over-long, edit warring (a recent example: Concerns and controversies related to UEFA Euro 2012). One could also wish that the discussion of the “memory effects” – a term mentioned only in the abstract and lead, which the author suggests is significant to understanding the conflict dynamic – was explained somewhere in the article (the term “memory” itself appears four times in the body and does not seem to be operationalized anywhere).

A press release accompanying the paper is titled “Wikipedia ‘edit wars’ show dynamics of conflict emergence and resolution“, while an MSNBC tech news headline summarized it as “Wikipedia is editorial warzone, says study“.

# Wikimedia Research Newsletter, May 2012

Vol: 2 • Issue: 5 • May 2012

Supporting interlanguage collaboration; detecting reverts; Wikipedia’s discourse, semantic and leadership networks, and Google’s Knowledge Graph

With contributions by: Jodi.a.schneider, Piotrus, Tbayer and Angelika Adam

### Discourse on Wikipedia sometimes irrational and manipulative, but still emancipating, democratic and productive

An article[1] in sociology journal The Information Society looks at interactions between Wikipedia editors and the project’s governance, visible in the articles on stem cells and transhumanism, and in the analysis of Wikipedia’s discussion of userboxes, all through the prism of Jürgen Habermas universal pragmatics and Mikhail Bakhtin dialogism theories.

The authors focus on the qualitative analysis of language used by editors, to argue that Wikipedia has elements of a democracy, and is an example of a Web 2.0–empowering discourse tool. They stress that some forms of discourse found online (including on Wikipedia) may be highly irrational, something that some previous arguments that Web 2.0 is a democratic space have often ignored, but they argue that this is in fact not as much of a hindrance as previously expected. Cimini and Burr remark that discourse can develop between Wikipedians of widely differing points of view, and that some editors will engage in “repeated, strategic, and often highly manipulative attempts” to assert personal authority. Such discussions may be very lively, involving “personal, emotional, or humour-based arguments”, yet the authors argue that such comments may not be a hindrance; instead, “on many occasions, there is thus a clearer exposition of views that is achieved, in spite of, or perhaps because of, these personal [and] sometimes vulgar methods of argumentation.”

In the end, the authors are positive about the success of Wikipedia’s deliberation in reaching consensus, although they say that it can be “fleeting and transitory” on occasion. Unfortunately, the paper does not touch on Wikipedia policies such as Wikipedia:Civility and Wikipedia:No personal attacks, which would certainly have added to their analysis.