Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png


Vol: 6 • Issue: 05 • May 2016 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

English as Wikipedia’s lingua franca; deletion rationales; schizophrenia controversies

With contributions by: Morten Warncke-Wang, Piotr Konieczny, Federico Leva, Steve Jankowski and Tilman Bayer

English still the lingua franca of Wikipedia

Reviewed by Morten Warncke-Wang

Primary and non-primary editors show different engagement levels (figure 4 from the paper)

Many of the more active Wikipedia contributors are multilingual. In the April 2011 Wikipedia Editors Survey,[supp 1] 72% of respondents said they read Wikipedia content in more than one language, and 51% said they contributed to multiple Wikipedias. Research has estimated that approximately 15% of active Wikipedians are multilingual.[supp 2] These contributors are important as they can enable knowledge transfer between different language editions of Wikipedia, yet little is known about who they are and what they do.

A recent paper published in PLOS ONE by researchers at KAIST and OII, titled “Understanding Editing Behaviors in Multilingual Wikipedia”[1], adds to our knowledge of multilingual contributors by investigating their engagement level, topic interests, and language proficiency. The paper uses a dataset spanning a month of Wikipedia contributions in July–August 2013 and defines a multilingual editor as one who make contributions to multiple languages. Overall the dataset contains 12,577 multilingual editors, of which 77.3% are bilingual, 11.4% trilingual, and 4.1% quadrilingual.

Out of Wikipedia’s (now) 288 language editions, the paper focuses on three: English, German, and Spanish. These three languages were chosen because the paper utilizes natural language processing to estimate language proficiency, and the tools available in those languages are sufficiently developed. The multilingual editors are divided into two groups: primary editors, consisting of the contributors who make most of their edits to a certain language edition, and non-primary editors. These two groups are then compared in terms of their engagement, topic interests, and language proficiency.

To measure editor engagement, consecutive edits by the same editor to the same article are collapsed into edit sessions.[supp 3] T-tests are used to compare primary and non-primary editors on several measures: number of edits per session, session length, amount of content added (number of characters or tokens such as words), and whether non-visible changes are made. The results show that primary editors are more engaged as they commit more edits, have longer sessions, add more content, and are more likely to make visible edits compared to non-primary editors.

Editor interests are identified using a combination of LDA and DBSCAN to create a set of 20 topic clusters for each language. These topic clusters are then labelled by humans, resulting in cluster labels such as “Science” and “Global Sports”. Primary and non-primary editors are found to be generally interested in the same topics, but some significant differences show up. For instance, non-primary editors are contributing more to articles about cities in English, soccer in German, and plants in Spanish. Primary editors are, on the other hand, more interested in, for example, computers in English and German, and politicians and entertainment in Spanish.

Lastly the paper studies the language complexity of contributions by primary and non-primary editors. Several measures of language complexity from the literature are used, for example entropy of parts-of-speech unigrams, bigrams, and trigrams, as well as whether articles (in English: the, a, an) are used correctly. Because different topics use language differently – for instance fact-oriented topics such as sports show lower language complexity compared to more conceptual topics such as history – both intra-topic complexity as well as inter-topic complexity is controlled for. Primary editors are found to use more diverse terms and edit more complex parts of articles compared to non-primary editors across all three languages. However, English differs from German and Spanish when it comes to linguistic proficiency of the edits made. In German and Spanish, primary editors display higher linguistic proficiency compared to the non-primary editors, whereas in English there is no noticeable difference.

Taken together, the results indicate how language continues to be a barrier to entry, seeing how non-primary editors are less engaged and make less complex contributions. The findings also point to how English continues to be a hub language in Wikipedia: It has the lowest proportion of primary editors with 32.9%, compared to German’s 49.9%. (In this context, the authors mention a 2012 WikiSym paper[supp 4], co-authored by this reviewer, which found that English was by far the most-used language to translate from – as measured by translation template usage – and discussed how English Wikipedia thereby could be used as a hub.) At the same time, multilingual Wikipedians are important in helping move content across languages, as exemplified by the Wikimedia Foundation’s development of a tool to recommend articles for translation.[supp 5] As mentioned in the paper’s conclusion, when it comes to multilingual Wikipedians there are still many questions left, although this paper makes significant contributions by answering some of them.

A new algorithmic tool for analyzing rationales on articles for deletion

Reviewed by Steve Jankowski

This article[2] is a report on one component of a longitudinal study of how “rationales” are utilized by Wikipedians on articles for deletion (AfD) to direct collaboration. In order to arrive at conclusions about the role of rationales in decision-making processes, the author has approached the research object from a number of angles. Previously the researcher had conducted an exploratory content analysis of rationales. This research was subsequently followed by interviews of Wikipedians. The current research describes the process of developing an algorithmic tool that will be able to analyze large data sets for “directive rationales”. The author admits that AfD discussions are predisposed to this kind of analysis due to the predictable order of comments that describe an action and a rationale for the action. Decision-making of this sort substantially differs from the style of discussion for the rest of Wikipedia’s talk pages. Regardless of this limitation, the author concludes that further research into rationales will provide insights into how it functions to connect policies with practices. Given the breadth of research methods of the project, it will be interesting to see what conclusions the author comes to when the project concludes.

Controversy goes online: schizophrenia genetics on Wikipedia

Reviewed by Piotr Konieczny

This paper[3] addresses the area of scientific knowledge creation online, as well as the notion of controversy, by examining the editing history and discussion about English Wikipedia pages on schizophrenia and its subpage, causes of schizophrenia. The specific controversy authors focused on is that of genetic basis for schizophrenia (a topic which the authors note is still debated by scholars and on which there is no consensus). The authors commend the neutrality of the lead of the Wikipedia article (“The causes of schizophrenia have been the subject of much debate, with various factors proposed and discounted or modified…”) and ask “How are such statements constructed, or in other words, what is the work which goes into making these claims?” The authors used a dataset from August 2006 to October 2011 (20,000 words of talk text and 13,000 words of article text) to investigate how this topic is presented and contested in Wikipedia.

The authors make a number of interesting observations. They observe that editors are not equal, and in addition to the usual admin>user>anon>bot hierarchy, they noted that “‘who you are’ is important when it comes to editing the schizophrenia article…”. Many editors self-identified as living with schizophrenia or as medical experts. The talk pages are policed to keep the discussion on discussing article’s contents, and anecdotes and personal experience stories are discouraged, or even removed from the pages. WP:V and WP:OR are certainly enforced as well, and Wikipedians will be pleased to note their observation that “Priority is always given to the published scientific literature.” However, there are also a number of problems. Not all contributors have access to paywalled, quality content, and some seemingly rely only on article abstracts.

Some low quality references slip through the net, and standards are not enforced consistently (“Attention to the reference list in the schizophrenia article at the time of our study revealed numerous citations that were not reviews”, but original research academic papers about “breakthroughs” – this mentioned in the context of a talkpage argument that “such papers should be avoided until their findings are confirmed”). The authors also note that they found at least “one reference to another Wikipedia article and also to a schizophrenia forum discussion”. The article’s structure is a result of years of minor edits with little attention to the big picture, resulting in occasionally illogical and incoherent layout with some contradictions or clearly obsolete but not updated sections, which leads the authors to summarize the state of the article as “a rather ad hoc assemblage of resources” and “a chronological patchwork of studies that nonetheless does have the effect of synthesising knowledge”. Despite those problems, they conclude that the Wikipedia article, and the creation process behind it, is similar to an academic review article. Also, despite Wikipedia’s claims that it is simply describing the state of things, rather than creating new arguments or points of view, the authors do think that the Wikipedia article is also an active voice in ongoing discussions, and note that some editors on the talk page see the purpose of the article as educating the public as well as some experts.

There are some unfortunate omissions (through to some degree understandable due to academic publish word limit). The authors do not discuss in detail whether some users, such as experts, seem to pull more weight in the discussions, or whether removal of personal stories impacts the friendliness of the discussion. Despite these omissions, the paper is an interesting analysis of knowledge creation on Wikipedia, as well as another contribution to the ongoing discussion about the reliability and quality of Wikipedia. On that note, it is worth noting that Schizophrenia is a Featured Article, following a 2003 nomination that by today’s FA standards is more like a joke. Given the criticism of the article’s 2011 version as voiced by this paper, the community may want to consider a Feature Article Review here.

Evaluating link-based recommendations for Wikipedia

Reviewed by Morten Warncke-Wang

Co-citation graphs (networks of who cites whom) are frequently used to recommend books and articles, but how well does links between Wikipedia articles work for this purpose? A paper[4] to be published at the upcoming Joint Conference on Digital Libraries evaluates this by comparing the performance of co-citation with and without proximity analysis against the commonly used “More Like This” (MLT) text-based approach found in Apache Lucene. The paper’s main finding is that co-citation with proximity analysis (CPA) performs comparably to MLT, but that the two methods have different strengths: MLT is good at identifying closely related articles, while CPA is better at finding broader ones and will identify more popular articles that typically are of higher quality. These results suggest a hybrid approach might be best suited for finding related articles in Wikipedia, something the authors plan to study in future work.


“Bridging the gap between Wikipedia and academia”

Reviewed by Piotr Konieczny

This paper[5] in JASIST from April this year is a brief opinion piece summarizing perceptions of Wikipedia in academia. It provides a short literature review of works that discuss this subjects, summarizes the research on Wikipedia’s reliability (still a concern among many scholars), notes the spread of the use of Wikipedia as a teaching assignment in colleges, acknowledges the general widespread use of Wikipedia by the public, and in the paper’s own words, calls “for a peaceful coexistence”. A more detailed take on those very subjects is presented by the very same journal in March[6] (disclaimer: the latter article is written by this reviewer).

Troublesome tools: how can Wikipedia editing enhance student teachers’ digital skills?

Reviewed by Federico Leva

Yet another small university class (<20 first-year university students) has independently tried Wikipedia editing and tells its story.[7]

The students were told to edit an article and succeeded; while doing so, they improved their information literacy, digital literacy and trust in the Wikipedia system. On the other hand, the exercise itself was not sufficient to make them understand in depth the dynamics and principles of Wikipedia nor to integrate in the community.

In the opinion of this reviewer, the article makes for a nice blog post to be shared with university professors belonging to other Nordic countries as well as similar disciplines. The experience also confirms that university professors can and should use Wikipedia as a teaching tool, but can improve results if they contact expert Wikimedians (usually via a local Wikimedia chapter) to actually introduce the students to the spirit and dynamics of the Wikimedia projects.

Wikipedia as a tool for 21st-century teaching and learning

Reviewed by Federico Leva

Short opinion piece from the University of Wisconsin-River Falls supporting the usage of Wikipedia as teaching tool to improve information literacy.[8] Under the guise of a literature review, the author mentions 4 past experiments of usage of the wikis in the classroom, published between 2006 and 2009.

Factors that influence the teaching use of Wikipedia in higher education

Reviewed by Federico Leva

According to this 2012 survey of 800 professors belonging to the Universitat Oberta de Catalunya, professors mostly agree with the usage of Wikipedia as an “open repository” to dissiminate research and a growing number of them approves of its usage as a teaching tool. At the time of the survey, however, most professors were still waiting to be convinced by their colleagues.[9] See also the longer review of the paper’s preprint version[supp 6] in our December 2014 issue: “Use of Wikipedia in higher education influenced by peer opinions and perception of Wikipedia’s quality

Analysing temporal evolution of interlingual Wikipedia article pairs

Reviewed by Morten Warncke-Wang

A paper[10] to be published at the forthcoming SIGIR 2016 conference as part of their demonstration track describes MultiWiki (demo available online), a tool that calculates similarities and differences between pairs of articles in different Wikipedia languages. The tool then visualises these using a timeline, a map, and by displaying article texts side-by-side. Visualising similarities and differences between Wikipedia languages is not a new idea[supp 7] [supp 8], but this tool is the first to show textual alignment.

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

  • “Accidental technologist: how can libraries improve Wikipedia?”[11] From the abstract: “Wikipedia and libraries got off to a strained start. Perhaps this is only my perception, but it appeared that Wikipedia was used as a defenseless punching bag in much information literacy instruction.”
  • “The implications of Wikipedia for contemporary science education: using social network analysis techniques for automatic organisation of knowledge”[12] From the abstract: ” In this paper we analyze a complete copy of the Spanish Wikipedia. We apply Social Networks Analysis Techniques and, more precisely, Communities Detection Techniques, in order to identify clusters of articles with similar content. […] We conclude that science articles are about 11.66 % of Spanish Wikipedia articles and that the most important clusters of scientific articles do not always coincide with classical Science disciplines.”
  • “‘Did i say something wrong?’ A word-level analysis of Wikipedia articles for deletion discussions”[13] [sic] From the abstract: “This thesis focuses on gaining linguistic insights into textual discussions on a word level. It was of special interest to distinguish messages that constructively contribute to a discussion from those that are detrimental to them. Thereby, we wanted to determine whether ‘I- and ‘You’-messages are indicators for either of the two discussion styles. […] we used Wikipedia Articles for Deletion (short: AfD) discussions together with the records of blocked users and developed a fully automated creation of an annotated data set. […] We found that ‘You’-messages were a strong indicator for disruptive messages which matches their attributed effects on communication. However, we found I’-messages to be indicative for disruptive messages as well which is contrary to their attributed effects.”
  • “Wikipedia: the difference between information acquisition and learning knowledge”[14] From the abstract: “This paper attempts to define Wikipedia in an information literacy context by providing an analysis of learning knowledge and Wikipedia’s structure.”
  • “Mapping bilateral information interests using the activity of Wikipedia editors”[15] From the abstract: “… we devise a scalable statistical model that identifies countries with similar information interests and measures the countries’ bilateral similarities. From the similarities we connect countries in a global network and find that countries can be mapped into 18 clusters with similar information interests. Through regression we find that language and religion best explain the strength of the bilateral ties and formation of clusters.”
  • “Extracting semantics from unconstrained navigation on Wikipedia”[16] From the abstract: “… we adapt a state of the art approach to extract semantic relatedness on Wikipedia paths. We apply this approach to transitions derived from two unconstrained navigation datasets as well as transitions from WikiGame and compare the results based on two common gold standards. […] Overall, we show that unconstrained navigation data on Wikipedia is suited for extracting semantics.”
  • ” LlamaFur: learning latent category matrix to find unexpected relations in Wikipedia”[17] From the abstract: “we focus on finding “unexpected links” in hyperlinked document corpora when documents are assigned to categories. “
  • “The lexicographic process of the German Wiktionary”[18] (in German, see also author’s website and our coverage of previous publications on Wiktionary by the same authors)
  • “Digital divisions of labor and informational magnetism: mapping participation in Wikipedia”[19] From the abstract: “Our regression analysis shows that the availability of broadband is a clear factor in the propensity of people to participate on Wikipedia. However, the relationship is not a linear one. As a country approaches levels of connectivity above about 450,000 broadband Internet connections, the ability of broadband access to positively affect participation keeps increasing. Complicating this issue is the fact that participation from the world’s economic peripheries tends to focus on editing about the world’s cores rather than their own local regions.”
  • “Gender biases in cyberspace: a two-stage model, the new arena of Wikipedia and other websites[20] From the abstract: “This Article innovatively argues that the virtual world excludes women in two stages: first, by controlling websites and filtering out women; and second, by exposing women who survived the first stage to a hostile environment. Wikipedia, as well as other cyber-space environments, demonstrates the execution of the model, which results in the exclusion of women from the virtual sphere with all the implications thereof.”
  • “Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes”‘[21] From the abstract: “We are developing a microbial specific data model, based on Wikidata’s semantic web compatibility, which represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 43 694 gene and 37 966 protein items for 21 species of bacteria …”



  1. Suin Kim, Sungjoon Park, Scott A. Hale, Sooyoung Kim, Jeongmin Byun, Alice H. Oh (2016-05-12). “Understanding Editing Behaviors in Multilingual Wikipedia”. PLOS ONE. 
  2. Lu, Xiao (2016). Hidden Gems in the Wikipedia Discussions: The Wikipedians’ Rationales. The Workshops of the Tenth International AAAI Conference on Web and Social Media. pp. 96–97. 
  3. Wyatt, Sally; Harris, Anna; Kelly, Susan E. (2016-02-12). “Controversy goes online: Schizophrenia genetics on Wikipedia”. Science & Technology Studies 29 (1). ISSN 2243-4690. 
  4. Malte Schwarzer, Moritz Schubotz, Norman Meuschke, Corinna Breitinger, Volker Markl, Bela Gipp (2016). Evaluating Link-based Recommendations for Wikipedia (PDF). JCDL. 
  5. Jemielniak, Dariusz; Aibar, Eduard (2016-03-01). “Bridging the gap between wikipedia and academia”. Journal of the Association for Information Science and Technology. doi:10.1002/asi.23691. ISSN 2330-1643.  Closed access
  6. Konieczny, Piotr (2016-04-01). “Teaching with Wikipedia in a 21st-century classroom: Perceptions of Wikipedia and its educational benefits”. Journal of the Association for Information Science and Technology. doi:10.1002/asi.23616. ISSN 2330-1643.  Closed access
  7. Brox, Hilde (2016-04-05). “Troublesome tools: How can Wikipedia editing enhance student teachers’ digital skills?”. Acta Didactica Norge 10 (2): 329–346. ISSN 1504-9922. [NPOV] was new to many of them. Some say they are surprised to find that there are so many rules and norms to consider before the text is up to standards. One respondent expressed astonishment that “there are even standards for how to write numbers in percentage!” Others are surprised to find any rules at all, having heard about the inaccuracies and biases of Wikipedia’s content: “I used to think anything goes.”’ … The students were positive about their discovery of the Wikipedia community, which for many changed some of their attitudes to the site. … For those who mention trust, they related it to one or both of the following factors: (a) to the discovery of the qualifications of many Wikipedians (“lots of educated people”) or (b) to the control mechanism available and that there are people who “check the pages” and “remove unwanted content” … The initial skepticism expressed in the questionnaire has thus changed, leaving Wikipedia “a place I can partly trust on par with other sources, as it is surveilled by a kind of administrators”.” 
  8. Christensen, T.B. (2015). Wikipedia as a Tool for 21st Century Teaching and Learning. International Journal for Digital Society, 6 (2), pp. 1055–1060.
  9. Meseguer-Artola, Antoni; Aibar, Eduard; Lladós, Josep; Minguillón, Julià; Lerga, Maura (2016-05-01). “Factors that influence the teaching use of Wikipedia in higher education”. Journal of the Association for Information Science and Technology 67 (5): 1224–1232. doi:10.1002/asi.23488. ISSN 2330-1643.  Closed access
  10. Gottschalk, Simon; Demidova, Elena (2016). Analysing Temporal Evolution of Interlingual Wikipedia Article Pairs (PDF). SIGIR. 
  11. Phetteplace, Eric (2015). “Accidental Technologist: How Can Libraries Improve Wikipedia?”. Reference and User Services Association 55 (2). 
  12. Figuerola, Carlos G.; Groves, Tamar; Quintanilla, Miguel Angel (2015). “The Implications of Wikipedia for Contemporary Science Education: Using Social Network Analysis Techniques for Automatic Organisation of Knowledge”. Proceedings of the 3rd International Conference on Technological Ecosystems for Enhancing Multiculturality. TEEM ’15. New York, NY, USA: ACM. pp. 403–410. doi:10.1145/2808580.2808641. ISBN 978-1-4503-3442-6.  Closed access
  13. Ruster, Michael (2016-03-15). Did i say something wrong?” A word-level analysis of Wikipedia articles for deletion discussions”. Campus Koblenz: Universität Koblenz-Landau.  (thesis)
  14. Ochola, J. Evans; Persson, Dorothy M.; Schumacher, Lisa A.; Lingo, Mitchell D. (2015-12-14). “Wikipedia: the difference between information acquisition and learning knowledge”. First Monday 20 (12). doi:10.5210/fm.v20i12.4875. ISSN 1396-0466. 
  15. Karimi, Fariba; Bohlin, Ludvig; Samoilenko, Anna; Rosvall, Martin; Lancichinetti, Andrea (2015-12-15). “Mapping bilateral information interests using the activity of Wikipedia editors”. Palgrave Communications 1: 15041. doi:10.1057/palcomms.2015.41. ISSN 2055-1045. 
  16. Niebler, Thomas; Schlör, Daniel; Becker, Martin; Hotho, Andreas (2015-12-16). “Extracting Semantics from Unconstrained Navigation on Wikipedia”. KI – Künstliche Intelligenz: 1–6. doi:10.1007/s13218-015-0417-5. ISSN 0933-1875.  Closed access
  17. Boldi, Paolo; Monti, Corrado (2016-03-31). “LlamaFur: Learning Latent Category Matrix to Find Unexpected Relations in Wikipedia (Long version)”. arXiv:1603.09540 [cs]. 
  18. Meyer, Christian M.; Gurevych, Iryna (January 2016). Der lexikographische Prozess im deutschen Wiktionary (The lexicographic process of the German Wiktionary). OPAL 2015. p. 82. doi:10.14618/opal_01-2016.  (in German)
  19. Graham, Mark; Straumann, Ralph K.; Hogan, Bernie (2015-09-07). Digital Divisions of Labor and Informational Magnetism: Mapping Participation in Wikipedia. Rochester, NY: Social Science Research Network. 
  20. Yanisky-Ravid, Shlomit; Mittelman, Amy (2016-01-01). “Gender Biases in Cyberspace: A Two-Stage Model, the New Arena of Wikipedia and Other Websites”. Fordham Intellectual Property, Media and Entertainment Law Journal 26 (2): 381. 
  21. Putman, Tim E.; Burgstaller-Muehlbacher, Sebastian; Waagmeester, Andra; Wu, Chunlei; Su, Andrew I.; Good, Benjamin M. (2016-01-01). “Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes”. Database 2016: baw028. doi:10.1093/database/baw028. ISSN 1758-0463. PMID 27022157. 
Supplementary references:
  1. “Editor Survey Report – April 2011.pdf” (PDF). 
  2. Hale, Scott A. (2014). Multilinguals and Wikipedia Editing (PDF). WebSci ’14. 
  3. Geiger, R. Stuart; Halfaker, Aaron (2013). Using edit sessions to measure participation in Wikipedia (PDF). CSCW. 
  4. Warncke-Wang, Morten; Uduwage, Anuradha; Dong, Zhenhua; Riedl, John (2012). “In Search of the ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network” (PDF). Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. WikiSym ’12. New York, NY, USA: ACM. doi:10.1145/2462932.2462959. ISBN 9781450316057. 
  5. Zia, Leila; Taraborelli, Dario (2016-04-27). “Find, Prioritize, and Recommend: An article recommendation system to fill knowledge gaps across Wikipedia”. 
  6. Meseguer Artola, Antoni; Aibar Puentes, Eduard; Lladós Masllorens, Josep; Minguillón Alfonso, Julià; Lerga Felip, Maura (2014-12-11). “Factors that influence the teaching use of Wikipedia in Higher Education” (Article). 
  7. Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, Darren Gergle (2012). Omnipedia: Bridging the Wikipedia Language Gap (PDF). CHI. 
  8. Massa, Paulo; Scrinzi, Federico (2012). Manypedia: Comparing Language Points of View of Wikipedia Communities (PDF). WikiSym. 


Wikimedia Research Newsletter
Vol: 6 • Issue: 05 • May 2016
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]