# Wikimedia blog

News from the Wikimedia Foundation and about the Wikimedia movement

# Wikimédia France Research Award 2013: And the winner is…

(This is a guest post by Carol Ann O’Hare of the French Wikimedia chapter.)

Wikimedia France is pleased to announce the first winner of the Wikimedia France Research Award:

Can history be open source ? Wikipedia and the future of the past by Roy Rosenzweig, published in The Journal of American History in 2006.

This choice was made from thirty scientific publications on Wikimedia projects and free knowledge, directly submitted by the Wikimedia community. Among these publications, a jury of researchers working on these topics has selected five finalists. All Wikimedians, along with the jury members, were encouraged to give their opinion and vote among these five finalists to determine the most relevant paper. This kind of open submission and voting process involving an entire community of non-expert people is unique for such an research award.

“Thought paper/essay that contrasts with classical scientific articles, but a very stimulating read.”

“Rosenzweig was a pioneer in digital history, incorporating new digital media and technology with history to explore new possibilities to reach a larger and diverse public audience.”

These are comments from the jury members and Wikimedians about this publication with significant impact in the field of digital history – almost 160 citations in other scientific publications, according to Google Scholar.

Roy Rosenzweig was a history professor at George Mason University (Virginia), he presented this paper on Wikipedia from the perspective of a historian. In his publication, Roy Rosenzweig focuses not just on factual accuracy, but also the quality of prose and the historical context of entry subjects.

In details, Roy Rosenzweig adds to a growing body of research trying to determine the accuracy of Wikipedia, in his comparative analysis of it with other online history references. He compares entries in Wikipedia with Microsoft’s online resource Encarta and American National Biography Online (ANBO). Where Encarta is for a mass audience, American National Biography Online is a more specialized history resource. Roy Rosenzweig takes a sample of 52 entries from the 18,000 found in ANBO and compares them with entries in Encarta and Wikipedia. In coverage, Wikipedia contained more of the topics from the sample than Encarta. Although the length of the articles didn’t reach the level of ANBO, Wikipedia articles were more lengthy than the entries in Encarta. Further, in terms of accuracy, Wikipedia and Encarta seemed basically on par with each other, which confirms a similar conclusion that the Nature study reached in its comparison of Wikipedia and Encyclopedia Britannica.

Then, Roy Rosenzweig discusses the effect of collaborative writing in more qualitative ways. He notes that collaborative writing often leads to less compelling prose. Multiple styles of writing, competing interests and motivations, varying levels of writing ability are all factors in the quality of a written text. Wikipedia entries may be for the most part factually correct, but are often not that well-written or historically relevant in terms of what receives emphasis. Due to piecemeal authorship, the articles often miss out on adding coherency to the larger historical conversation. ANBO has well crafted entries, they are often authored by well known historians.

However, the quality of writing needs to be balanced with accessibility. ANBO is subscription-based, whereas Wikipedia is free, which reveals how access to a resource plays a role in its purpose. As a product of the amateur historian, Rosenzweig comments upon the tension created when professional historians engage with Wikipedia. He notes that it tends to be full of interesting trivia, but the seasoned historian will question its historic significance. As well, the professional historian has great concern for citation and sourcing references, which is not as rigorously enforced in Wikipedia.

Because of Wikipedia’s widespread and growing use, it challenges the authority of the professional historian, and therefore cannot be ignored. The tension raises questions about the professional historian’s obligation to Wikipedia. To this point, Roy Rosenzweig notes there is an obligation and need to provide the public with quality information in Wikipedia or some other venue. He concludes by looking forward and describing what the professional historian can learn from open collaborative production models.

You can view the full publication (in English) here: http://chnm.gmu.edu/essays-on-history-new-media/essays/?essayid=42 and on the Research Award’s dedicated website: http://researchaward.wikimedia.fr/en

Roy Rosenzweig died in 2007. Wikimédia France has decided to award the prize of € 2,500 to the Center for History and New Media, founded in 1994 by Roy Rosenzweig.

In launching this international research award, Wikimédia France wanted to highlight research works dedicated to Wikipedia in particular, and provide a greater visibility for these research works among the entire Wikimedia community. A new edition of the Prize will take place in 2014.

Carol Ann O’Hare
Wikimedia France

# Wikimedia Research Newsletter, April 2013

Vol: 3 • Issue: 4 • April 2013

Sentiment monitoring; Wikipedians and academics favor the same papers; UNESCO and systemic bias; How ideas flow on Wikiversity

With contributions by: Piotr Konieczny, Oren Bochman, Taha Yasseri, Jonathan T. Morgan and Tilman Bayer

### Too good to be true? Detecting COI, Attacks and Neutrality using Sentiment Analysis

Traditional methods for detecting sentiment are less objective

Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report[1] on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.

The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.

### “A Comparative Study of Academic impact and Wikipedia Ranking”

A paper[2] with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. (more…)

# Wikimedia Research Newsletter, March 2013

Vol: 3 • Issue: 3 • March 2013

“Ignore all rules” in deletions; anonymity and groupthink; how readers react when shown talk pages

With contributions by: Amir E. Aharoni, Piotr Konieczny, Taha Yasseri, Oren Bochman, Heather Ford, Tilman Bayer, Giovanni Luca Ciampaglia, Daniel Mietchen.

### Wikipedia’s “Ignore all rules” policy (IAR) is a double edged sword in deletion arguments

A beetle larva ignoring the rules while negotiating deletion with the frog.[mediasource 1]

A paper presented at last month’s CSCW Conference, titled “Keeping eyes on the prize: officially sanctioned rule breaking in mass collaboration systems”[1] observes that “Mass collaboration systems are often characterized as unstructured organizations lacking rule and order”, yet Wikipedia has a well developed body of policies to support it as an organization. Rule breaking in bureaucracies is a slippery slope quickly leading to potentially dangerous exceptions, so Wikipedia has a mechanism called “Ignore all rules” (WP:IAR) for officially sanctioned rule breaking. The researchers have considered IAR’s impact within the scope of deletion requests. The results show that the IAR policy has meaningful influences on deliberation outcomes, which rather than wreaking havoc, provides a positive, functional governance mechanism.

This paper is another welcome addition to the growing literature on AfD, examining the effectiveness of rule breaking using WP:IAR within these discussions. It starts with an in depth extermination of rule breaking within collaborative environments. Then these six hypotheses are postulated:

1. Invocation of WP:IAR in support of vote correlates with increased likelihood of the decision that the vote will be on the winning side.
2. This effect is expected to increase with the number of policies cited in the deletion proposal (since they may be contradicting each other).
3. Invoking IAR to override the deletion proposal’s policy citation tends to reduce the proposal’s likelihood of success.
4. When IAR is used together with another policy domain (e.g. Content/Conduct/Legal) as the proposal’s rationale, it will negate the proposal’s success.
5. Increased dissonance between policies arising in the discussion will increase the chance that the IAR argument will be successful.
6. IAR will increase in effectiveness as the policies invoked increase in complexity.

# Wikimedia Research Newsletter, February 2013

Vol: 3 • Issue: 2 • February 2013

Wikipedia not so novel after all, except to UK university lecturers; EPOV instead of NPOV

With contributions by: Piotr Konieczny, Taha Yasseri, Heather Ford, Sage Ross, Daniel Mietchen and Tilman Bayer.

### Wikipedia in historic context: “Stigmergic accumulation” is not new

Page with the entry Encyclopédie from Diderot and D’Alembert’s Encyclopédie. The work was the result of the collaboration of more than 100 contributors.

“Wikipedia and Encyclopedic Production”[1] by Jeff Loveland (a historian of encyclopedias) and Joseph Reagle situates Wikipedia within the context of encyclopedic production historically, arguing that the features that many claim to be unique about Wikipedia actually have roots in encyclopedias of the past. Loveland and Reagle criticize characterizations of Wikipedia that they believe to be ahistorical and exaggerated, laying special blame with authors who compare Wikipedia’s anonymous production to Encyclopedia Britannica’s production by named experts, and thus ignore the rich tradition of encyclopedic production through the centuries. The authors then set about characterizing the history of encyclopedic production as composed of three overlapping forms: compulsive collection, stigmergic accumulation, and corporate production.

‘Compulsive collection’ refers to the work of compiling encyclopedias that has traditionally been done by a few dedicated, tireless, detail-oriented individuals. Loveland and Reagle point out that, although Wikipedians share this compulsive behavior with past encyclopedists, the crucial distinction lies in the fact that the vast majority were motivated by money (even if this motive existed alongside more idealistic motivations) whereas Wikipedia editors are unpaid.

Loveland and Reagle use the term ‘stigmergic accumulation’ to refer to the process of production by accretion onto a previous text. Even those responsible for a singly authored encyclopedia were relying on predecessors, the authors argue, ‘building on their work and using the cumulative character of texts and knowledge as a ladder of sorts’. Examples of existing texts included the use of a previous edition of an encyclopedia that ran into multiple editions, and the practice of borrowing between different encyclopedias that was sometimes illegal but more often viewed as ‘piratical’ i.e. morally wrong.

The category of ‘corporate production’ is used by Loveland and Reagle to describe the process of encyclopedic editing by a group – groups that topped a thousand contributors in the 20th century. Editors of early encyclopedias like Diderot and D’Alembert’s Encyclopédie in the 1700s faced the challenge of trying to coordinate the contributions of about 140 contributors in a similar way to Wikipedia having to confront issues of consistency that result in debates about how important a subject must be to merit an article. In contrast to other encyclopedias, write Loveland and Reagle, Wikipedia settles these debates through community decision-making and in the open. The authors also note that previous encyclopedias didn’t always recruit on the basis of expertise and that some recognized that it would be cheaper and sometimes more accurate to have non-experts summarizing the works of experts.

# A year’s worth of Wikipedia research

Twelve years after its launch, Wikipedia continues to attract a large amount of attention from scholarly research trying to understand what made this one of the most remarkable collaborative efforts in history and what makes it work. Researchers have called Wikipedia “our Everest” (because of its complexity and cultural importance) or “the Drosophila (fruit fly) of social software” (because the project’s transparency and freely available data make it accessible and popular as a research subject).

In 2011, we launched a monthly Wikimedia Research Newsletter with the aim of covering recent academic research about Wikipedia and other Wikimedia projects. Published jointly by the Wikimedia Research Committee and the Signpost (the English Wikipedia’s community-edited newspaper), it has established itself as a comprehensive outlet enabling both researchers and Wikipedians to stay on top of current research, aiming to facilitate exchange between these two communities.

Today we are announcing the release in the public domain of a curated corpus containing the bibliographic references of all 225 publications reviewed or covered in the second volume of the newsletter, forming a historical record of Wikipedia research in the year 2012. This corpus can be browsed online or downloaded, ready to be imported into reference managers or other literature collections. Papers in this dataset have been marked as either open access or closed access .

Last year, we published a similar dataset for volume 1 (2011). Together, these releases complement other efforts to catalogue the research literature on Wikipedia, in particular the WikiLit project which focuses on publications until June 2011, prior to the launch of the newsletter.

Follow @WikiResearch for fresh Wikimedia research news

A year ago we launched the @WikiResearch news feed on Twitter and Identi.ca, covering new preprints, papers or research-related blog posts, before they are reviewed more fully in the Newsletter. As of February 2013, it has gained 745 followers and continues to be actively updated.

The Newsletter is a collaborative effort and would not exist without the following 22 people who contributed reviews and summaries in 2012:

More than half of our contributors are researchers themselves, who have published about Wikipedia in peer-reviewed publications. We are also grateful for the help of several Signpost collaborators in copyediting and preparing the final publication every month.

consider contributing by pointing us to new research we should cover, or by volunteering to review new publications.

The editors of the Wikimedia Research Newsletter:

Tilman Bayer, Senior Operations Analyst
Dario Taraborelli, Senior Research Analyst

# Suggesting tasks for new Wikipedians

If you had just signed up to become a Wikipedia contributor, what kind of experience would you like to have? Would you know exactly where to get started, or would you prefer some suggestions?

For most of Wikipedia’s 12-year history, we have done very little to proactively introduce new participants to tasks that are interesting and easy. Right after account creation, for instance, we merely suggest that you check out your preferences. If you look around, you can find guides like Wikipedia:Tutorial. Most of this documentation is focused on the rules and mechanics of how to contribute, rather than suggesting real tasks to try immediately.

Naturally, the kind of people who have tended to thrive in this environment already know what they want to contribute, or are deeply motivated to go and find it. Unless you’ve spotted an error or a missing piece of information, there is little pointing you in the right direction. That lack of direction is a big part of why only about a quarter of all newly-registered accounts complete an edit.

This phenomenon is far from unique to the site, and in fact it would be surprising to hear of any site where 100% of signups become devoted content contributors. However, when considering the enormous workload we face, the sheer waste of human capital is staggering. In English Wikipedia alone, there are…

• more than 200,000 “citation needed” tags
• 3,000 articles that need basic copyediting
• over 14,000 pages that need more wiki links

The list goes on, and these are just the items that have been explicitly added to the backlog. Wikipedia is in fact bursting at the seams with small problems that need fixing.

So how do we match the thousands of people who sign up every day, eager and willing to help, with tasks that are easy to do? That’s the question we’re attempting to solve with our work onboarding new Wikipedians, at the Wikimedia Foundation’s Editor Engagement Experiments team.

# Vote for the most exciting paper from nine years of research about Wikipedia

(This is a guest post by Carol Ann O’Hare of Wikimedia France.)

The impact of collaborative writing on the quality of Wikipedia content, new methods for monitoring contributions in order to fight vandalism, how the nature and quality of content depends on contributors’ status and the area covered, etc. These topics concern the Wikimedians who write and use Wikipedia… but also more and more researchers!

By launching an international award for research on Wikimedia projects and free knowledge, Wikimédia France wants to highlight these research works, encourage them and especially, make them understandable and accessible to the Wikimedia community.

Starting in July, the first step was to ask the community of researchers that study Wikimedia projects to nominate scientific papers that they consider the most influential and important from the years 2003 to 2011. We collected more than 30 proposals, each satisfying the selection criteria: Available under open access and published in peer-reviewed publications. It is thanks to a quality jury, composed of researchers working on these topics, that we could select five finalists papers among these. You can find summaries and full texts linked below:

To decide the winner, Wikimédia France wishes to encourage all Wikimedians to give their opinion and vote for the paper that seems the most stimulating and relevant.

Voting will close on Monday, March 11. The announcement of the winning paper is scheduled for the end of March. The authors will receive a grant of €2,500. They can freely allocate this sum, provided it is dedicated to help open knowledge research.

Carol Ann O’Hare
Wikimedia France

# Wikimedia Research Newsletter, January 2013

Vol: 3 • Issue: 1 • January 2013

Lessons from the research literature on open collaboration; clicks on featured articles; credibility heuristics

With contributions by: Taha Yasseri, Piotrus, Aaron Shaw, Tbayer and Lui8E

### Lessons from the wiki research literature in “American Behavioral Scientist” special issue

A special issue of the American Behavioral Scientist is devoted to “open collaboration”.

• Consistent patterns found in Wikipedia and other open collaborations: In the introductory piece[1], researchers Andrea Forte and Cliff Lampe give an overview of this field, defined as the study of “distributed, collaborative efforts made possible because of changes in information and communication technology that facilitate cooperative activities” – with open source projects and Wikipedia among the most prominent examples. They point out that “[b]y now, thousands of scholars have written about open collaboration systems, many hundreds of thousands of people have participated in them, and millions of people use products of open collaboration every day.” Among their “lessons from the literature”, they name three “consistent patterns” found by researchers of open collaborations:
• “Participation Is Unequal” (meaning that some participants contribute vastly more than others: “In Wikipedia, for example, it has long been shown that a few editors provide the bulk of contributions to the site.”)
• “There Are Special Requirements for Socializing New Users”
• “Users Are Massively Heterogeneous in Both How and Why They Participate”
• “Ignore All Rules” as “tension release mechanism”: The abstract of paper titled “Rules and Roles vs. Consensus: Self-Governed Deliberative Mass Collaboration Bureaucracies” [2] explains “Wikipedia’s unusual policy, ignore all rules (IAR)” as a “tension release mechanism” that is “reconciling the tension between individual agency and collective goals” by “[supporting] individual agency when positions taken by participants might conflict with those reflected in established rules. Hypotheses are tested with Wikipedia data regarding individual agency, bureaucratic processes, and IAR invocation during the content exclusion process. Findings indicate that in Wikipedia each utterance matters in deliberations, rules matter in deliberations, and IAR citation magnifies individual influence but also reinforces bureaucracy.”
• Collaboration on articles about breaking news matures more quickly: “Hot Off the Wiki: Structures and Dynamics of Wikipedia’s Coverage of Breaking News Events”[3] analyzes “Wikipedia articles about over 3,000 breaking news events, [investigating] the structure of interactions between editors and articles”, finding that “breaking articles emerge into well-connected collaborations more rapidly than nonbreaking articles, suggesting early contributors play a crucial role in supporting these high-tempo collaborations.” (see also our earlier review of a similarly-themed paper by the same team: “High-tempo contributions: Who edits breaking news articles?“)

A fourth paper in this special issue, titled “The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline”, found considerable media attention this month, starting with an article in USA Today. It was already reviewed in the September issue of the research report.

# Wikimedia Research Newsletter, December 2012

Vol: 2 • Issue: 12 • December 2012

Wikipedia and Sandy Hook; SOPA blackout reexamined

With contributions by: Daniel Mietchen, Piotrus, Junkie.dolphin, Taha Yasseri, Benjamin Mako Hill, Aaron Shaw, Tbayer, DarTar and Ragesoss

### How Wikipedia deals with a mass shooting

Northeastern University researcher Brian Keegan analyzed the gathering of hundreds of Wikipedians to cover the Sandy Hook Elementary School shooting in the immediate aftermath of the tragedy. The findings are reported in a detailed blog post that was later republished by the Nieman Journalism Lab.[1] Keegan observes that the Sandy Hook shooting article reached a length of 50Kb within 24 hours of its creation, making it the fastest growing article by length in the first day among recent articles covering mass shootings on the English-language Wikipedia. The analysis compares the Sandy Hook page with six similar articles from a list of 43 articles on shooting sprees in the US since 2007. Among the analyses described in the study, of particular interest is the dynamics of dedicated vs occasional contributors as the article reaches maturity: while in the first few hours contributions are evenly distributed with a majority of single-edit editors, after hour 3 or 4 a number of dedicated editors show up and “begin to take a vested interest in the article, which is manifest in the rapid centralization of the article”. A plot of inter-edit time also shows the sustained frequency of revisions that these articles display days after their creation, with Sandy Hook averaging at about 1 edit/minute around 24 hours since its first revision. The notebook and social network data produced by the author for the analysis are available on his website. The Nieman Journalism Lab previously covered the role that Wikipedia is playing as a platform for collaborative journalism, and why its format outperforms Wikinews with an interview of Andrew Lih published in 2010.[2] The early revision history of the Sandy Hook shooting article was also covered in a blog post by Oxford Internet Institute fellow Taha Yasseri, however with a focus on the coverage in different Wikipedia language editions.[3]

### Network positions and contributions to online public goods: the case of the Chinese Wikipedia

A graph with nodes color-coded by betweenness centrality (from red=0 to blue=max).

In a forthcoming paper in the Journal of Management Information Systems (presented earlier at HICSS ’12[4]), Xiaoquan (Michael) Zhang and Chong (Alex) Wang use a natural experiment to demonstrate that changes to the position of individuals within the editor network of a wiki modify their editing behavior. The data for this study came from the Chinese Wikipedia. In October 2005, the Chinese government suddenly blocked access to the Chinese Wikipedia from mainland China, creating an unanticipated decline in the editor population. As a result, the remaining editors found themselves in a new network structure and, the authors claim, any changes in editor behavior that ensued are likely effects of this discontinuous “shock” to the network. (more…)

# Wikimedia Research Newsletter, November 2012

Vol: 2 • Issue: 11 • November 2012

Movie success predictions, readability, credentials and authority, geographical comparisons

With contributions by: Piotrus, Benjamin Mako Hill, Tbayer, DarTar, Adler.fa, Hfordsa, Drdee

### Early prediction of movie box-office revenues with Wikipedia data

An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.

The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in The Guardian, the MIT Technology Review and the Hollywood Reporter among others. The authors observe that their approach, being “free of any language based analysis, e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other kinds of products”. The dataset used for this study, including the financial and Wikipedia activity data is available among the supplementary materials of the paper.

### Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared

$
4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43
$

The automated readability index, one of the readability metrics used in the study[2]

A study[2] by researchers at Kyoto University presents a detailed assessment of the readability of the English Wikipedia against Encyclopedia Britannica and the Simple English Wikipedia using a series of readability metrics and finds that Wikipedia “seems to lag behind the other encyclopedias in terms of readability and comprehensibility of its content”. (more…)