Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png

Vol: 4 • Issue: 9 • September 2014 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

99.25% of Wikipedia birthdates accurate; focused Wikipedians live longer; merging WordNet, Wikipedia and Wiktionary

With contributions by: Scott Hale, Piotr Konieczny, Maximilian Klein, Andrew Krizhanovsky, Tilman Bayer and Pine

“Reliability of user-generated data: the case of biographical data in Wikipedia”

“Third Volume of a 1727 edition of Plutarch‘s Lives of the Noble Greeks and Romans printed by Jacob Tonson”; caption quoted from the Wikipedia article Biography

Review by User:Maximilianklein

0.75% of Wikipedia birthdates are inaccurate, reported Robert Viseur at WikiSym 2014.[1] Those inaccuracies are “low, although higher than the 0.21% observed for the baseline reference sources”. Given that biographies represent 15% of English Wikipedia,[supp 1] the third largest category after “arts” and “culture”, their accuracy is important. The method used was to find biographies that were both in Wikipedia and 9 reference databases, which are sadly not named due to the wishes of an “anonymous sponsor” of the paper (Red flag or Belgian bureaucracy?). Of 938 such articles found, those whose birthdates did not match in all 10 databases – 14.4% – were manually investigated. Some errors were due to coincidental names, thus proving the point for authority control in collecting data. One capping anecdote is that most of the mistakes in Wikipedia’s 0.75% were corrected in the intervening time between data collection and manual investigation. However, one may need to account for the sample bias that these were the biographies which existed in 10 separated databases – well known personalities. Therefore the predictive power of the study remains limited, but at least we know that some objective data on Wikipedia has the same order of magnitude error rate as other “reliable sources”.

Focused Wikipedians stay active longer

Group photo of Wikimedians at Wikimania 2012

A new preprint[2] by three Dublin-based computer scientists contributes to the debate around editor retention. The authors use techniques such as the topic modeling and non-negative matrix factorization. to categorize Wikipedians into several profiles (“e.g. content experts, social networkers”). Those profiles, or user roles, are based on namespaces that editors are most active in. The authors analyzed the behavior of about half a million Wikipedia editors. The authors find that short-term editors seem to lack interest in any one particular aspect of Wikipedia, editing various namespaces briefly before leaving the project. Long-term editors are more likely to focus on one or two namespaces (usually mainspace, plus article talk or user talk pages), and only after some time diversify to different namespaces; in other words, the namespace distribution of edits over time “predicts an editor’s departure from the community”. The authors note that “we show that understanding patterns of change in user behavior can be of practical importance for community management and maintenance”.

Unfortunately, the paper is heavy in jargon and statistical models, and provides little practical data (or at least, that data is not presented well). For example, the categorization of editors into seven groups is very interesting, but no descriptive data is presented that would allow us to compare the number of editors in each group. Further, the paper promises to use those profiles to predict editor lifecycles, but such models don’t seem to be present in the paper. In the end, this reviewer finds this paper to be an interesting idea that hopefully will develop into some research with meaningful findings – for now, however, it seems more of a theoretical analysis with no practical applications.

“WordNet-Wikipedia-Wiktionary: construction of a three-way alignment”

A Wiktionary logo

Reviewed by Andrew Krizhanovsky

The authors of this paper,[3] presented at the International Conference on Language Resources and Evaluation (LREC 2014), integrated two previously constructed alignments for WordNet-Wikipedia and WordNet-Wiktionary into a three-way alignment WordNet-Wikipedia-Wiktionary. This integration results in lower accuracy, but greater coverage in comparison with two-way alignment.

Wiktionary does not provide a convenient and consistent means of directly addressing individual lexical items or their associated senses. Third-party tools such as the JWKTL (Java-based Wiktionary Library) API can overcome this problem.

Since the WordNet–Wikipedia alignment is for nouns only, the resulting synonym sets in the conjoint threeway alignment consist entirely of nouns. However, the full three-way alignment contains all parts of speech (adjectives, nouns, adverbs, verbs, etc.).

Larger synonym sets in the source data (WordNet and Wiktionary) results in more incorrect mapping in the outcome alignment (this is strange from the average person’s point of view and shows that the alignment algorithm is not perfect yet).

Informal examination shows that conjoint alignment is correct in general, but existing errors in the source alignments were magnified (snowball effect).


Measures of edit quality

A work-in-progress paper[4] reviews measures of edit quality on Wikipedia and reports the results of a pilot project to evaluate the “Persistent Word Revisions” (PWR)[supp 2] metric of edit quality with the ratings of Amazon’s Mechanical Turk users. PWR measures how much of an edit is preserved through subsequent revisions to the article. The paper only evaluates “a small pool of 63 total [Mechanical Turk] ratings of 10 [article] revisions” and therefore has no significant results. Nonetheless, the future validation on a much larger set of edits as promised in the paper should be useful to future researchers. It will also be useful to know how the distribution of PWR scores compare with other measures of article quality such as the quality assessments given by WikiProjects, nominations for Good Article or Featured Article status. A comparison with Adler et al.’s WikiTrust scores could also be valuable.

“A Wiki Framework for the Sweble Engine”

This master thesis[5] builds on previous work of professor Dirk Riehle’s research group at the University of Erlangen-Nuremberg which had constructed a formal parser for MediaWiki wikitext, adding a web application that allows editing wikis based on this parser.

How quickly are drug articles updated after FDA warnings?

A short article[6] in the New England Journal of Medicine examined how quickly safety warnings by the US Food and Drug Administration (FDA) for 22 prescription drugs were incorporated into the corresponding Wikipedia articles. The authors “found that 41% of Wikipedia pages pertaining to the drugs with new safety warnings were updated within 2 weeks … The Wikipedia pages for drugs that were intended for treatment of highly prevalent diseases (affecting more than 1 million people in the United States) were more likely to be updated quickly (58% were updated within 2 weeks) than were those for drugs designed to treat less-prevalent conditions (20% were updated within 2 weeks …).” See also the discussion at WikiProject Medicine: 1 2

“Spiral of silence” in German Wikipedia’s image filter discussions

A paper titled “The Dispute over Filtering ‘indecent’ Images in Wikipedia”[7] examines disputes in 2010 and 2011 about controversial content on Wikipedia, and about the Wikimedia Foundation’s proposal for an opt-in image filter which would have allowed users to hide sexual or violent media for themselves (see the Signpost summary by this reviewer). The author finds that several of German sociologist Jürgen Habermas‘ criteria for public discourse apply to the lengthy discussions on the German Wikipedia about this topic (highlighting one talk page with 120 major threads that fill 175 pages in a PDF). “However, [Habermas’] criteria of rationality and objectivity seem to be less applicable. Compared to other areas of dispute in Wikipedia, the German discussions were civilized – but emotional.” The paper invokes the “spiral of silence” theory of public opinion to explain the German Wikipedia’s huge opposition to the Wikimedia Foundation’s plans: “the climate of opinion in the online discussions put supporters of the image filter under heavy pressure to conform or to be silent”. Finally, the paper reports on the results of a small web-based experiment where 163 participants were randomly shown one of three versions of the article de:Furunkel (boil): Either without images, or with a “neutral image”, or “with a somewhat disgusting image of an infected boil.” The author states that “The most interesting results for the Wikipedia community is that the disgusting image enhances the perceived quality of the article: It is perceived to be more fascinating (p=.023) and more worth reading (p=.032) than an article without any image.”

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.


  1. VISEUR, Robert (2014). “Reliability of User-Generated Data:the Case of Biographical Data in Wikipedia”. WikiSym 2014. Retrieved 24 September 2014. 
  2. Qin, Xiangju (29 July 2014). “A latent space analysis of editor lifecycles in Wikipedia“. Proc. of 5th International Workshop on Mining Ubiquitous and Social Environments (MUSE) at ECML/PKDD 2014. 
  3. Miller, Tristan; Iryna Gurevych (May 2014). “WordNet-Wikipedia-Wiktionary: construction of a three-way alignment”. Proceedings of the 9th International Conference on Language Resources and Evaluations.  data
  4. Biancani, Susan (2014). “Measuring the Quality of Edits to Wikipedia”. WikiSym 2014. Retrieved 24 September 2014. 
  5. Liping Wang: A Wiki Framework for the Sweble Engine. Master thesis, Friedrich-Alexander University Erlangen-Nürnberg 2014 PDF
  6. Hwang, Thomas J. (2014). “Drug Safety in the Digital Age“. New England Journal of Medicine 370 (26): 2460–2462. doi:10.1056/NEJMp1401767. ISSN 0028-4793. PMID 24963564. 
  7. Thomas Roessing: The Dispute over Filtering “indecent” Images in Wikipedia. Masaryk University Journal of Law and Technology Issue: 2/2013 PDF
  8. Britt, Brian C. (January 2014). “Evolution and revolution of organizational configurations on wikipedia: A longitudinal network analysis“. Purdue University.  Closed access
  9. Rijt, Arnout van de (28 April 2014). “Field experiments of success-breeds-success dynamics“. Proceedings of the National Academy of Sciences: 201316836. doi:10.1073/pnas.1316836111. ISSN 0027-8424. PMID 24778230. 
  10. Lee, Kyungho (2014). “How collective intelligence emerges: knowledge creation process in Wikipedia from microscopic viewpoint”. Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. AVI ’14. New York, NY, USA: ACM. pp. 373–374. DOI:10.1145/2598153.2600040. ISBN 978-1-4503-2775-6.  Closed access
  11. Temple, Norman J. (2014). “How accurate are Wikipedia articles in health, nutrition, and medicine? / Les articles de Wikipédia dans les domaines de la santé, de la nutrition et de la médecine sont-ils exacts ?“. Canadian Journal of Information and Library Science 38 (1): 37–52. ISSN 1920-7239.  Closed access
  12. Joanne Robert: Community and the dynamics of spatially distributed knowledge production. The case of Wikipedia in: The social dynamics of innovation networks. edited by Roel Rutten, Paul Benneworth, Dessy Irawati, Frans Boekema p.179ff
  13. DeDeo, Simon (8 July 2014). “Group minds and the case of Wikipedia“. 
  14. Mesgari, Mostafa and Okoli, Chitu and Mehdi, Mohamad and Nielsen, Finn Årup and Lanamäki, Arto (2014) “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology. ISSN 2330-1635 (In Press) PDF
Supplementary references and notes:
  1. Whats in Wikipedia?.
  2. Halfaker, A., Kittur, A., Kraut, R., & Riedl, J. (2009). A Jury. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia”. WikiSym ’09. Retrieved 24 September 2014. 

Wikimedia Research Newsletter
Vol: 4 • Issue: 9 • September 2014
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email @WikiResearch on WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]