Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png

Vol: 4 • Issue: 8 • August 2014 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

A Wikipedia-based Pantheon; new Wikipedia analysis tool suite; how AfC hamstrings newbies

With contributions by: Federico Leva, Piotr Konieczny, Maximilian Klein, and Pine

Wikipedia in all languages used to rank global historical figures of all time

A research group at MIT led by Cesar A. Hidalgo published[1] a global “Pantheon” (probably the same project already mentioned in our December 2012 issue), where Wikipedia biographies are used to identify and “score” thousands of global historical figures of all time, together with a previous compilation of persons having written sources about them. The work was also covered in several news outlets. We won’t summarise here all the details, strengths and limits of their method, which can already be found in the well-written document above.

Many if not most of the headaches encountered by the research group lie in the work needed to aggregate said scores by geographical areas. It’s easy to get the city of birth of a person from Wikipedia, but it’s hard to tell to what ancient or modern country that city corresponds, for any definition of “country”. (Compare our recent review of a related project by a different group of researchers that encountered the same difficulties: “Interactions of cultures and top people of Wikipedia from ranking of 24 language editions”.) The MIT research group has to manually curate a local database; in an ideal world, they’d just fetch from Wikidata via an API. Aggregation by geographical area, for this and other reasons, seems of lesser interest than the place-agnostic person rank.

The most interesting point is that a person is considered historically relevant when being the subject of an article on 25 or more editions of Wikipedia. This method of assessing an article’s importance is often used by editors, but only as an unscientific approximation. It’s a useful finding that it proved valuable for research as well, though with acknowledged issues. The study is also one of the rare times researchers bother to investigate Wikipedia in all languages at the same time and we hope there will be follow-ups. For instance, it could be interesting to know which people with an otherwise high “score” were not included due to the 25+ languages filter, which could then be further tweaked based on the findings. As an example of possible distortions, Wikipedia has a dozen subdomains for local languages of Italy, but having an article in 10 italic languages is not an achievement of “global” coverage more than having 1.

The group then proceeded to calculate a “historical cultural production index” for those persons, based on pageviews of the respective biographies (PV). This reviewer would rather call it a “historical figures modern popularity index”. While the recentism bias of the Internet (which Wikipedia acknowledges and tries to fight back) for selection is acknowledged, most of the recentism in this work is in ranking, because of the usage of pageviews. As WikiStats shows, 20% of requests come from a country (the US) with only 5% of the world population, or some 0.3% of the total population in history (assumed as ~108 billion). Therefore there is an error/bias of probably two orders of magnitude in the “score” for “USA” figures; perhaps three, if we add that five years of pageviews are used as sample for the whole current generation. L* is an interesting attempt to correct the “languages count” for a person (L) in the cases where visits are amassed in single languages/countries; but a similar correction would be needed for PV as well.

From the perspective of Wikipedia editors, it’s a pity that Wikipedia is the main source for such a rank, because this means that Wikipedians can’t use it to fill gaps: the distribution of topic coverage across languages is complex and far from perfect; while content translation tools will hopefully help make it more even, prioritisation is needed. It would be wonderful to have a rank of notably missing biographies per language editions of Wikipedia, especially for under-represented groups, which could then be forwarded to the local editors and featured prominently to attract contributions. This is a problem often worked on, from ancient times to recent tools, but we really lack something based on third party sources. We have good tools to identify languages where a given article is missing, but we first need a list (of lists) of persons with any identifier, be it authority record or Wikidata entry or English name or anything else that we can then map ourselves.

The customary complaint about inconsistent inclusion criteria can also be found: «being a player in a second division team in Chile is more likely to pass the notoriety criteria required by Wikipedia Editors than being a faculty at MIT», observe the MIT researchers. However, the fact that nobody has bothered to write an article on a subject doesn’t mean that the project as a whole is not interested in having that article; articles about sports people are just easier to write, the project needs and wants more volunteers for everything. Hidalgo replied that he had some examples of deletions in mind; we have not reviewed them, but it’s also possible that the articles were deleted for their state rather than for the subject itself, a difference to which “victims” of deletion often fail to pay attention to.

WikiBrain: Democratizing computation on Wikipedia

– by Maximilianklein

When analyzing any Wikipedia version, getting the underlying data can be a hard engineering task, beyond the difficulty of the research itself. Being developed by researchers from Macalester College and the University of Minnesota, WikiBrain aims to “run a single program that downloads, parses, and saves Wikipedia data on commodity hardware.” [2] Wikipedia dump-downloaders and parsers have long existed, but WikiBrain is more ambitious in that it tries to be even friendlier by introducing three main primitives: a multilingual concept network, semantic relatedness algorithms, and geospatial data integration. With those elements, the authors are hoping that Wikipedia research will become a mix-and-match affair.

Waldo Tobler’s First Law of Geography – “everything is related to everything else, but near things are more related than distant things” – can be shown true for Wikipedia articles in just a few lines of code with WikiBrain.

The first primitive is the multilingual concept network. Since the release of Wikidata, the Universal Concepts that all language versions of Wikipedia represent have mostly come to be defined by the Wikidata item that each language mostly links to. “Mostly” is a key word here, because there are still some edge cases, like the English Wikipedia’s distinguishing between the concepts of “high school” and “secondary school“, while others do not. WikiBrain will give you the Wikidata graph of multilingual concepts by default, and the power to tweak this as you wish.

The next primitive is semantic relatedness (SR), which is the process of quantifying how close two articles are by their meaning. There have been literally hundreds of SR algorithms proposed over the last two decades. Some rely on Wikipedia’s links and categories directly. Others require a text corpus, for which Wikipedia can be used. Most modern SR algorithms can be built one way or another with Wikipedia. WikiBrain supplies the ability to use five state-of-the-art SR algorithms, or their ensemble method – a combination of all 5.

Already at this point an example was given of how to mix our primitives. In just a few lines of code, one could easily find which articles in all languages were closest to the English article on “jazz”, and which were also a tagged as a film in Wikidata.

The last primitive is a suite of tools that are useful for spatial computation. So extracting location data out of Wikipedia and Wikidata can become a standardized process. Incorporated are some classic solutions to the “geoweb scale problem” – that regardless of an entity’s footprint in space, it is represented by a point. That is a problem one shouldn’t have to think about, and indeed, WikiBrain will solve it for you under the covers.

To demonstrate the power of WikiBrain the authors then provide a case study wherein they replicate previous research that took “thousands of lines of code”, and do it in “just a few” using WikiBrain’s high-level syntax. The case study is cherry-picked as is it previous research of one of the listed authors on the paper – of course it’s easy to reconstruct one’s own previous research in a framework you custom-built. The case study is a empirical testing of Tobler’s first law of geography using Wikipedia articles. Essentially one compares the SR of articles versus their geographic closeness – and it’s verified they are positively linked.

Does the world need an easier, simpler, more off-the-shelf Wikipedia research tool? Yes, of course. Is WikiBrain it? Maybe or maybe not, depending on who you are. The software described in the paper is still version 0.3. There are notes explaining the upcoming features of edit history parsing, article quality ranking, and user data parsing. The project and its examples are written in Java, which is a language choice that targets a specific demographic of researchers, and alienates others. That makes WikiBrain a good tool for Java programmers who do not know how to parse off-line dumps, and have an interest in either multilingual concept alignment, semantic relatedness, and spatial relatedness. For everyone else, they will have to make do with one of the other 20+ alternative parsers and write their own glueing code. That’s OK though; frankly the idea to make one research tool to “rule them all” is too audacious and commandeering for the open-source ecosystem. Still that doesn’t mean that WikiBrain can’t find its userbase and supporters.

Newcomer productivity and pre-publication review

It’s time for another interesting paper on newcomer retention[3] from authors with a proven track record of tackling this issue. This time they focus on the Articles for Creation|Wikipedia:Articles for Creation|Articles for Creation mechanism. The authors conclude that instead of improving the success of newcomers, AfC in fact further decreases their productivity. The authors note that once AfC was fully rolled out around mid-2011, it began to be widely used – the percentage of newcomers using it went up from <5% to ~25%. At the same time, the percentage of newbie articles surviving on Wikipedia went down from ~25% to ~15%. The authors hypothesize that the AfC process is unfriendly to newcomers due to the following issues: 1) it’s too slow, and 2) it hides drafts from potential collaborators.

The authors find that the AfC review process is not subject to insurmountable delays; they conclude that “most drafts will be submitted for review quickly and that reviews will happen in a timely manner.”. In fact, two-thirds of reviews take place within a day of submission (a figure that positively surprised this reviewer, though a current AfC status report suggests a situation has worsened since: “Severe backlog: 2599 pending submissions”). In either case, the authors find that about a third or so of newcomers using the AfC system fail to understand the fact that they need to finalize the process by submitting their drafts to the review at all – a likely indication that the AfC instructions need revising, and that the AfC regulars may want to implement a system of identifying stalled drafts, which in some cases may be ready for mainspace despite having never been officially “submitted” (due to their newbie creator not knowing about this step or carrying it out properly).

However, the authors do stand by their second hypothesis: they conclude that the AfC articles suffer from not receiving collaborative help that they would get if they were mainspaced. They discuss a specific AfC, for the article Dwight K. Shellman, Jr/Dwight Shellman. This article has been tagged as potentially rescuable, and has been languishing in that state for years, hidden in the AfC namespace, together with many other similarly backlogged articles, all stuck in low-visibility limbo and prevented from receiving proper Wikipedia-style collaboration-driven improvements (or deletion discussions) as an article in the mainspace would receive.

The researchers identify a number of other factors that reduce the functionality of the AfC process. As in many other aspects of Wikipedia, negative feedback dominates. Reviewers are rarely thanked for anything, but are more likely to be criticized for passing an article deemed problematic by another editor; thus leading to the mentality that “rejecting articles is safest” (as newbies are less likely to complain about their article’s rejection than experienced editors about passing one). AfC also suffers from the same “one reviewer” problem as GA – the reviewer may not always be qualified to carry out the review, yet the newbies have little knowledge how to ask for a second opinion. The authors specifically discuss a case of reviewers not familiar with the specific notability criteria: “[despite being notable] an article about an Emmy-award winning TV show from the 1980’s was twice declined at AfC, before finally being published 15 months after the draft was started”. Presumably if this article was not submitted to a review it would never be deleted from the mainspace.

The authors are critical of the interface of the AfC process, concluding that it is too unfriendly to newbies, instruction wise: “Newcomers do not understand the review process, including how to submit articles for review and the expected timeframe for reviews” and “Newcomers cannot always find the articles they created. They may recreate drafts, so that the same content is created and reviewed multiple times. This is worsened by having multiple article creation spaces(Main, userspace, Wikipedia talk, and the recently-created Draft namespace“.

The researchers conclude that AfC works well as a filtering process for the encyclopedia, however “for helping and training newcomers [it] seems inadequate”. AfC succeeds in protecting content under the (recently established) speedy deletion criterion G13, in theory allowing newbies to keep fixing it – but many do not take this opportunity. Nor can the community deal with this, and thus the authors call for a creation of “a mechanism for editors to find interesting drafts”. That said, this reviewer wants to point out that the G13 backlog, while quite interesting (thousands of articles almost ready for main space …), is not the only backlog Wikipedia has to deal with – something the writers overlook. The G13 backlog is likely partially a result of imperfect AfC design that could be improved, but all such backlogs are also an artifact of the lack of active editors affecting Wikipedia projects on many levels.

In either case, AfC regulars should carefully examine the authors suggestions. This reviewer finds the following ideas in particular worth pursuing. 1) Determine which drafts need collaboration and make them more visible to potential editors. Here the authors suggest use of a recent academic model that should help automatically identify valuable articles, and then feeding those articles to SuggestBot. 2) Support newcomers’ first contributions – almost a dead horse at this point, but we know we are not doing enough to be friendly to newcomers. In particular, the authors note that we need to create better mechanisms for newcomers to get help on their draft, and to improve the article creation advice – especially the Article Wizard. (As a teacher who has introduced hundreds of newcomers to Wikipedia, this reviewer can attest that the current outreach to newbies on those levels is grossly inadequate.)

A final comment to the community in general: was AfC intended to help newcomers, or was it intended from the start to reduce the strain on new page patrollers by sandboxing the drafts in the first place? One of the roles of AfC is to prevent problematic articles from appearing in the mainspace, and it does seem that in this role it is succeeding quite well. English Wikipedia community has rejected the flagged revisions-like tool, but allowed implementation of it on a voluntary basis for newcomers, who in turn may not often realize that by choosing the AfC process, friendly on the surface, they are in fact slow-tracking themselves, and inviting extraordinary scrutiny. This leads to a larger question that is worth considering: we, the Wikipedia community of active editors, have declined to have our edits classified as second-tier and hidden from the public until they are reviewed, but we are fine pushing this on to the newbies. To what degree is this contributing to the general trend of Wikipedia being less and less friendly to newcomers? Is the resulting quality control worth turning away potential newbies? Would we be here if years ago our first experience with Wikipedia was through AfC?


PLOS Biology is an open-access peer-reviewed scientific journal covering all aspects of biology. Publication began on October 13, 2003.
(“PLoS Biology cover April 2009” by PLoS, under CC-BY-2.5)

15% of PLOS Biology articles are cited on Wikipedia

A conference paper titled “An analysis of Wikipedia references across PLOS publications”[4] asked the following research questions: “1) To what extent are scholarly articles referenced in Wikipedia, and what content is particularly likely to be mentioned?” and “2) How do these Wikipedia references correlate with other article-level metrics such as downloads, social media mentions, and citations?”. To answer this, the authors analyzed which PLOS articles are referenced on Wikipedia. They found that as of March 2014, about 4% of PLOS articles were mentioned on Wikipedia, which they conclude is “similar to mentions in science blogs or the post-publication peer review service, F1000Prime“. About half of articles mentioned on Wikipedia are also mentioned on Facebook, suggesting that being cited on Wikipedia is related to being picked up by other social media. Most of Wikipedia cites come from PLOS Genetics, PLOS Biology and other biology/medicine related PLOS outlets, with PLOS One accounting for only 3% total, though there are indications this is changing over time. 15% of all articles from PLOS Biology have been cited on Wikipedia, the highest ratio among the studied journals. Unfortunately, this is very much a descriptive paper, and the authors stop short of trying to explain or predict anything. The authors also observe that “By far the most referenced PLOS article is a study on the evolution of deep-sea gastropods (Welch, 2010) with 1249 references, including 541 in the Vietnamese Wikipedia.”

“Big data and small: collaborations between ethnographers and data scientists”

Ethnography is often seen as the least quantitative branch of social science, and this[5] essay-like article’s style is a good illustration. This is, essentially, a self-reflective story of a Wikipedia research project. The author, an ethnographer, recounts her collaboration with two big data scholars in a project dealing with a large Wikipedia dataset. The results of their collaboration are presented here and have been briefly covered by our Newsletter in Issue 8/13. This article can be seen as an interesting companion to the prior, Wikipedia-focused piece, explaining how it was created, though it fails to answer questions of interest to the community, such as “why did the authors choose Wikipedia as their research ground” or about their experiences (if any) editing Wikipedia.

“Emotions under discussion: gender, status and communication in online collaboration”

Researchers investigated[6] “how emotion and dialogue differ depending on the status, gender, and the communication network of the ~12,000 editors who have written at least 100 comments on the English Wikipedia’s article talk pages.” Researchers found that male administrators tend to use an impersonal and neutral tone. Non-administrator females used more relational forms of communication. Researchers also found that “editors tend to interact with other editors having similar emotional styles (e.g., editors expressing more anger connect more with one another).” Authors of this paper will present their research at the September Wikimedia Research and Data showcase.


  2. Sen, Shilad. “WikiBrain: Democratizing computation on Wikipedia“. OpenSym ’14 0 (0): 1–19. doi:10.1145/2641580.2641615.  Open access
  3. Jodi Schneider, Bluma S. Gelley Aaron Halfaker: Accept, decline, postpone: How newcomer productivity is reduced in English Wikipedia by pre-publication review OpenSym ’14 , August 27–29, 2014, Berlin
  4. Fenner, Martin; Jennifer Lin (June 6, 2014), “An analysis of Wikipedia references across PLOS publications”, altmetrics14 workshop at WebSci, doi:10.6084/m9.figshare.1048991 
  5. Ford, Heather (1 July 2014). “Big data and small: collaborations between ethnographers and data scientists“. Big Data & Society 1 (2): 2053951714544337. doi:10.1177/2053951714544337. ISSN 2053-9517. 
  6. Laniado, David; Carlos Castillo; Mayo Fuster Morell; Andreas Kaltenbrunner (2014-08-20). “Emotions under Discussion: Gender, Status and Communication in Online Collaboration”. PLoS ONE 9 (8): e104880. doi:10.1371/journal.pone.0104880. 

Wikimedia Research Newsletter
Vol: 4 • Issue: 8 • August 2014
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email @WikiResearch on WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]