AI-generated Wikipedia articles give rise to debate about research ethics

At the International Joint Conference on Artificial Intelligence (IJCAI) – one of the prime AI conferences, if not the pre-eminent one – Banerjee and Mitra from Penn State published the paper “WikiWrite: Generating Wikipedia Articles Automatically”.[1]

The system described in the paper looks for red links in Wikipedia and classifies them based on their context. To find section titles, it then looks for similar existing articles. With these titles, the system searches the web for information, and eventually uses content summarization and a paraphrasing algorithm. The researchers uploaded 50 of these automatically created articles to Wikipedia, and found that 47 of them survived. Some were heavily edited after upload, others not so much.


While I was enthusiastic about the results, I was surprised by the suboptimal quality of the articles I reviewed – three that were mentioned in the paper. After a brief discussion with the authors, a wider discussion was initiated on the Wiki-research mailing list. This was followed by an entry on the English Wikipedia administrators’ noticeboard (which includes a list of all accounts used for this particular research paper). The discussion led to the removal of most of the remaining articles.

The discussion concerned the ethical implications of the research, and using Wikipedia for such an experiment without the consent of Wikipedia contributors or readers. The first author of the paper was an active member of the discussion; he showed a lack of awareness of these issues, and appeared to learn a lot from the discussion. He promised to take these lessons to the relevant research community – a positive outcome.

In general, this sets an example for engineers and computer-science engineers, who often show a lack of awareness of certain ethical issues in their research. Computer scientists are typically trained to think about bits and complexities, and rarely discuss in depth how their work impacts human lives. Whether it’s social networks experimenting with the mood of their users, current discussions of biases in machine-learned models, or the experimental upload of automatically created content in Wikipedia without community approval, computer science has generally not reached the level of awareness of some other sciences for the possible effects of their research on human subjects, at least as far as this reviewer can tell.

Even in Wikipedia, there’s no clear-cut, succinct Wikipedia policy I could have pointed the researchers to. The use of sockpuppets was a clear violation of policy, but an incidental component of the research. WP:POINT was a stretch to cover the situation at hand. In the end, what we can suggest to researchers is to check back with the Wikimedia Research list. A lot of people there have experience with designing research plans with the community in mind, and it can help to avoid uncomfortable situations.

See also our 2015 review of a related paper coauthored by the same authors: “Bot detects theatre play scripts on the web and writes Wikipedia articles about them” and other similarly themed papers they have published since then: “WikiKreator: Automatic Authoring of Wikipedia Content”[2], “WikiKreator: Improving Wikipedia Stubs Automatically”[3], “Filling the Gaps: Improving Wikipedia Stubs”[4]. DV

Ethics researcher: Vandal fighters should not be allowed to see whether an edit was made anonymously

A paper[5] in the journal Ethics and Information Technology examines the “system of surveillance” that the English Wikipedia has built up over the years to deal with vandalism edits. The author, Paul B. de Laat from the University of Groningen, presents an interesting application of a theoretical framework by US law scholar Frederick Schauer that focuses on the concepts of rule enforcement and profiling. While providing justification for the system’s efficacy and largely absolving it of some of the objections that are commonly associated with the use of profiling in e.g. law enforcement, de Laat ultimately argues that in its current form, it violates an alleged “social contract” on Wikipedia by not treating anonymous and logged-in edits equally. Although generally well-informed about both the practice and the academic research of vandalism fighting, the paper unfortunately fails to connect to an existing debate about very much the same topic – potential biases of artificial intelligence-based anti-vandalism tools against anonymous edits – that was begun last year[6] by the researchers developing ORES (an edit review tool that was just made available to all English Wikipedia users, see this week’s Technology report) and most recently discussed in the August 2016 WMF research showcase.

The paper first gives an overview of the various anti-vandalism tools and bots in use, recapping an earlier paper[7] where de Laat had already asked whether these are “eroding Wikipedia’s moral order” (following an even earlier 2014 paper in which he had argued that new-edit patrolling “raises a number of moral questions that need to be answered urgently”). There, de Laat’s concerns included the fact that some stronger tools (rollback, Huggle, and STiki) are available only to trusted users and “cause a loss of the required moral skills in relation to newcomers”, and that they a lack of transparency about how the tools operate (in particular when more sophisticated artificial intelligence/machine learning algorithms such as neural networks are used). The present paper expands on a separate but related concern, about the use of “profiling” to pre-select which recent edits will be subject to closer human review. The author emphasizes that on Wikipedia this usually does not mean person-based offender profiling (building profiles of individuals committing vandalism), citing only one exception in form of a 2015 academic paper – cf. our review: “Early warning system identifies likely vandals based on their editing behavior“. Rather, “the anti-vandalism tools exemplify the broader type of profiling” that focuses on actions. Based on Schauer’s work, the author asks the following questions:

  1. “Is this profiling profitable, does it bring the rewards that are usually associated with it?”
  2. “is this profiling approach towards edit selection justified? In particular, do any of the dimensions in use raise moral objections? If so, can these objections be met in a satisfactory fashion, or do such controversial dimensions have to be adapted or eliminated?”

But snakes are much more dangerous! According to Schauer, while general rules are always less fair than case-by-case decisions, their existence can be justified by other arguments.

To answer the first question, the author turns to Schauer’s work on rules, in a brief summary that is worth reading for anyone interested in Wikipedia policies and guidelines – although de Laat instead applies the concept to the “procedural rules” implicit in vandalism profiling (such as that anonymous edits are more likely to be worth scrutinizing). First, Schauer “resolutely pushes aside the argument from fairness: decision-making based on rules can only be less just than deciding each case on a particularistic basis “. (For example, a restaurant’s “No Dogs Allowed” rule will unfairly exclude some well-behaved dogs, while not prohibiting much more dangerous animals such as snakes.) Instead, the existence of rules have to be justified by other arguments, of which Schauer presents four:

  • Rules “create reliability/predictability for those affected by the rule: rule-followers as well as rule-enforcers”.
  • Rules “promote more efficient use of resources by rule-enforcers” (e.g. in case of a speeding car driver, traffic police and judges can apply a simple speed limit instead having to prove in detail that an instance of driving was dangerous).
  • Rules, if simple enough, reduce the problem of “risk-aversion” by enforcers, who are much more likely to make mistakes and face repercussions if they have to make case by case decisions.
  • Rules create stability, which however also presents “an impediment to change; it entrenches the status-quo. If change is on a society’s agenda, the stability argument turns into an argument against having (simple) rules.”

The author cautions that these four arguments have to be reinterpreted when applying them to vandalism profiling, because it consists of “procedural rules” (which edits should be selected for inspection) rather than “substantive rules” (which edits should be reverted as vandalism, which animals should be disallowed from the restaurant). While in the case of substantive rules, their absence would mean having to judge everything on a case-by-case basis, the author asserts that procedural rules arise in a situation where the alternative would be to to not judge at all in many cases: Because “we have no means at our disposal to check and pass judgment on all of them; a selection of a kind has to be made. So it is here that profiling comes in”. With that qualification, Schauer’s second argument provides justification for “Wikipedian profiling [because it] turns out to be amazingly effective”, starting with the autonomous bots that auto-revert with an (aspired) 1:1000 false-positive rate.

De Laat also interprets “the Schauerian argument of reliability/predictability for those affected by the rule” in favor of vandalism profiling. Here, though, he fails to explain the benefits of vandals being able to predict which kind of edits will be subject to scrutiny. This also calls into question his subsequent remark that “it is unfortunate that the anti-vandalism system in use remains opaque to ordinary users”. The remaining two of Schauer’s four arguments are judged as less pertinent. But overall the paper concludes that it is possibile to justify the existence of vandalism profiling rules as beneficial via Schauer’s theoretical framework.

Police traffic stops: A good analogy for anti-vandalism patrol on Wikipedia?

Photo by böhringer, CC BY-SA 3.0

Next, de Laat turns to question 2, on whether vandalism profiling is also morally justified. Here he relies on later work by Schauer, from a 2003 book, “Profiles, Probabilities, and Stereotypes”, that studies such matters as profiling by tax officials (selecting which taxpayers have to undergo an audit), airport security (selecting passengers for screening) and by police officers (e.g. selecting cars for traffic stops). While profiling of some kind is a necessity for all these officials, the particular characteristics (dimensions) used for profiling can be highly problematic (see e.g. Driving While Black). For de Laat’s study of Wikipedia profiling, “two types of complications are important: (1) possible ‘overuse’ of dimension(s) (an issue of profile effectiveness) and (2) social sensibilities associated with specific dimension(s) (a social and moral issue).” Overuse can mean relying on stereotypes that have no basis in reality, or over-reliance on some dimensions that, while having a non-spurious correlation with the deviant behavior, are over-emphasized at the expense of other relevant characteristics because they are more visible or salient to the profile. E.g. while Schauer considers that it may be justified for “airport officials looking for explosives [to] single out for inspection the luggage of younger Muslim men of Middle Eastern appearance”, it would be an over-use if “officials ask all Muslim men and all men of Middle Eastern origin to step out of line to be searched”, thus reducing their effectiveness by neglecting other passenger characteristics. This is also an example for the second type of complication profiling, where the selected dimensions are socially sensitive – indeed, for the specific case of luggage screening in the US, “the factors of race, religion, ethnicity, nationality, and gender have expressly been excluded from profiling” since 1997.

Applying this to the case of Wikipedia’s anti-vandalism efforts, de Laat first observes that complication (1) (overuse) is not a concern for fully automated tools like ClueBotNG – obviously their algorithm applies the existing profile directly without a human intervention that could introduce this kind of bias. For Huggle and STiki, however, “I see several possibilities for features to be overused by patrollers, thereby spoiling the optimum efficacy achievable by the profile embedded in those tools.” This is because both tools do not just use these features in their automatic pre-selection of edits to be reviewed, but expose at least the fact whether an edit was anonymous to the human patroller in the edit review interface. (The paper examines this in detail for both tools, also observing that Huggle presents more opportunities for this kind of overuse, while STiki is more restricted. However, there seems to have been no attempt to study empirically whether this overuse actually occurs.)

Regarding complication (2), whether some of the features used for vandalism profiling are socially sensitive, de Laat highlights that they include some amount of discrimination by nationality: IP edits geolocated to the US, Canada, and Australia have been found to contain vandalism more frequently and are thus more likely to be singled out for inspection. However, he does not consider this concern “strong enough to warrant banning the country-dimension and correspondingly sacrifice some profiling efficacy”, chiefly because there do not appear to be a lot of nationalistic tensions within the English Wikipedia community that could be stirred up by this.

In contrast, de Laat argues that “the targeting of contributors who choose to remain anonymous … is fraught with danger since anons already constitute a controversial group within the Wikipedian community.” Still, he acknowledges the “undisputed fact” that the ratio of vandalism is much higher among anonymous edits. Also, he rejects the concern that they might be more likely to be the victim of false positives:

normally [IP editors] do not experience any harm when their edits are selected and inspected as a result of anon-powered profiling; they will not even notice that they were surveilled since no digital traces remain of the patrolling. … The only imaginable harm is that patrollers become over focussed on anons and indulge in what I called above ‘overinspection’ of such edits and wrongly classify them as vandalism … As a consequence, they might never contribute to Wikipedia again. … Nevertheless, I estimate this harm to be small. At any rate, the harm involved would seem to be small in comparison with the harassment of racial profiling—let alone that an ‘expressive harm hypothesis’ applies.

With this said, de Laat still makes the controversial call “that the anonymous-dimension should be banned from all profiling efforts” – including removing it from the scoring algorithms of Huggle, STiki and ClueBotNG. Instead of concerns about individual harm,

my main argument for the ban is a decidedly moral one. From the very beginning the Wikipedian community has operated on the basis of a ‘social contract’ that makes no distinction between anons and non-anons – all are citizens of equal stature. … In sum, the express profiling of anons turns the anonymity dimension from an access condition into a social distinction; the Wikipedian community should refrain from institutionalizing such a line of division. Notice that I argue, in effect, that the Wikipedian community has only two choices: either accept anons as full citizens or not; but there is no morally defensible social contract in between.

Sadly, while the paper is otherwise rich in citations and details, it completely fails to provide evidence for the existence of this alleged social contract. While it is true that “the ability of almost anyone to edit (most) articles without registration” forms part of Wikipedia’s founding principles (a principle that this reviewer strongly agrees with), the “equal stature” part seems to be de Laat’s own invention – there is a long list of things that, by longstanding community consensus, require the use of an account (which after all is freely available to everyone, without even requiring an email address). Most of these restrictions – say, the inability to create new articles or being prevented from participating in project governance during admin or arbcom votes – seem much more serious than the vandalism profiling that is the topic of de Laat’s paper. TB


Conferences and events

Other recent publications

A list of other recent publications that could not be covered in time for this issue—contributions are always welcome for reviewing or summarizing newly published research. This month, the list mainly gathers research about the extraction of specific content from Wikipedia.

  • “Large SMT Data-sets Extracted from Wikipedia”[8] From the abstract: “The article presents experiments on mining Wikipedia for extracting SMT [ statistical machine translation ] useful sentence pairs in three language pairs. … The optimized SMT systems were evaluated on unseen test-sets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and vice-versa, we call this type of translation experiments ‘in-genre’ translation. As in the case of ‘in-domain’ translation, our evaluations showed that using only ‘in-genre’ training data for translating same genre new texts is better than mixing the training data with ‘out-of-genre’ (even) parallel texts.”
  • “Recognizing Biographical Sections in Wikipedia”[9] From the abstract: “Thanks to its coverage and its availability in machine-readable format, [Wikipedia] has become a primary resource for large scale research in historical and cultural studies. In this work, we focus on the subset of pages describing persons, and we investigate the task of recognizing biographical sections from them: given a person’s page, we identify the list of sections where information about her/his life is present [as opposed to nonbiographical sections, e.g. ‘Early Life’ but not ‘Legacy’ or ‘Selected writings’].”
  • “Extraction of lethal events from Wikipedia and a semantic repository”[10] From the abstract and conclusion: “This paper describes the extraction of information on lethal events from the Swedish version of Wikipedia. The information searched includes the persons’ cause of death, origin, and profession. […] We also extracted structured semantic data from the Wikidata store that we combined with the information retrieved from Wikipedia … [The resulting] data could not support the existence of the Club 27“.
  • “Learning Topic Hierarchies for Wikipedia Categories”[11] (from frequently used section headings in a category, e.g. “eligibility”, “endorsements” or “results” for Category:Presidential elections)
  • “‘A Spousal Relation Begins with a Deletion of engage and Ends with an Addition of divorce’: Learning State Changing Verbs from Wikipedia Revision History.”[12] From the abstract: “We propose to learn state changing verbs [such as ‘born’, ‘died’, ‘elected’, ‘married’] from Wikipedia edit history. When a state-changing event, such as a marriage or death, happens to an entity, the infobox on the entity’s Wikipedia page usually gets updated. At the same time, the article text may be updated with verbs either being added or deleted to reflect the changes made to the infobox. … We observe in our experiments that when state-changing verbs are added or deleted from an entity’s Wikipedia page text, we can predict the entity’s infobox updates with 88% precision and 76% recall.”
  • “Extracting Representative Phrases from Wikipedia Article Sections”[13] From the abstract: “Since [Wikipedia’s] long articles are taking time to read, as well as section titles are sometimes too short to capture comprehensive summarization, we aim at extracting informative phrases that readers can refer to.”
  • “Accurate Fact Harvesting from Natural Language Text in Wikipedia with Lector”[14] From the abstract: “Many approaches have been introduced recently to automatically create or augment Knowledge Graphs (KGs) with facts extracted from Wikipedia, particularly its structured components like the infoboxes. Although these structures are valuable, they represent only a fraction of the actual information expressed in the articles. In this work, we quantify the number of highly accurate facts that can be harvested with high precision from the text of Wikipedia articles […]. Our experimental evaluation, which uses Freebase as reference KG, reveals we can augment several relations in the domain of people by more than 10%, with facts whose accuracy are over 95%. Moreover, the vast majority of these facts are missing from the infoboxes, YAGO and DBpedia.”
  • “Extracting Scientists from Wikipedia”[15] From the abstract: “[We] describe a system that gathers information from Wikipedia articles and existing data from Wikidata, which is then combined and put in a searchable database. This system is dedicated to making the process of finding scientists both quicker and easier.”
  • “LeadMine: Disease identification and concept mapping using Wikipedia”[16] From the abstract: “LeadMine, a dictionary/grammar based entity recognizer, was used to recognize and normalize both chemicals and diseases to MeSH [ Medical Subject Headings ] IDs. The lexicon was obtained from 3 sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon.”
  • “Finding Member Articles for Wikipedia Lists”[17] From the abstract: “… for a given Wikipedia article and list, we determine whether the article can be added to the list. Its solution can be utilized on automatic generation of lists, as well as generation of categories based on lists, to help self-organization of knowledge structure. In this paper, we discuss building classifiers for judging on whether an article belongs to a list or not, where features are extracted from various components including list titles, leading sections, as well as texts of member articles. … We report our initial evaluation results based on Bayesian and other classifiers, and also discuss feature selection.”
  • “Study of the content about documentation sciences in the Spanish-language Wikipedia”[18] (in Spanish). From the English abstract: “This study explore how [Wikipedia] addresses the documentation sciences, focusing especially on pages that discuss the discipline, not only the page contents, but the relationships between them, their edit history, Wikipedians who participated and all aspects that can influence on how the image of this discipline is projected” [sic]. TB


  1. Siddhartha Banerjee, Prasenjit Mitra, “WikiWrite: Generating Wikipedia Articles Automatically”.
  2. Banerjee, Siddhartha; Mitra, Prasenjit (October 2015). “WikiKreator: Automatic Authoring of Wikipedia Content”. AI Matters 2 (1): 4–6. doi:10.1145/2813536.2813538. ISSN 2372-3483.  Closed access
  3. Banerjee, Siddhartha and Mitra, Prasenjit: “WikiKreator: Improving Wikipedia Stubs Automatically, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing” (Volume 1: Long Papers), July 2015, Beijing, China, Association for Computational Linguistics, pages 867–877,
  4. Banerjee, Siddhartha; Mitra, Prasenjit (2015). “Filling the Gaps: Improving Wikipedia Stubs”. Proceedings of the 2015 ACM Symposium on Document Engineering. DocEng ’15. New York, NY, USA: ACM. pp. 117–120. doi:10.1145/2682571.2797073. ISBN 9781450333078.  Closed access
  5. Laat, Paul B. (30 April 2016). “Profiling vandalism in Wikipedia: A Schauerian approach to justification”. Ethics and Information Technology: 1–18. doi:10.1007/s10676-016-9399-8. ISSN 1388-1957. 
  6. See e.g. Halfaker, Aaron (December 6, 2015). “Disparate impact of damage-detection on anonymous Wikipedia editors”. Socio-technologist. 
  7. Laat, Paul B. de (2 September 2015). “The use of software tools and autonomous bots against vandalism: eroding Wikipedia’s moral order?”. Ethics and Information Technology 17 (3): 175–188. doi:10.1007/s10676-015-9366-9. ISSN 1388-1957. 
  8. Tufiş, Dan; Ion, Radu; Dumitrescu, Ştefan; Ştefănescu2, Dan (26 May 2014). “Large SMT Data-sets Extracted from Wikipedia” (PDF). Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). TUFI 14.103. ISBN 978-2-9517408-8-4. 
  9. Aprosio, Alessio Palmero; Tonelli, Sara (17 September 2015). “Recognizing Biographical Sections in Wikipedia”. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. pp. 811–816. 
  10. Norrby, Magnus; Nugues, Pierre (2015). Extraction of lethal events from Wikipedia and a semantic repository (PDF). workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015. Vilnius, Lithuania. 
  11. Hu, Linmei; Wang, Xuzhong; Zhang, Mengdi; Li, Juanzi; Li, Xiaoli; Shao, Chao; Tang, Jie; Liu, Yongbin (2015-07-26). “Learning Topic Hierarchies for Wikipedia Categories” (PDF). Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers). Beijing, China. pp. 346–351. 
  12. Nakashole, Ndapa; Mitchell, Tom; Wijaya, Derry (2015). “A Spousal Relation Begins with a Deletion of engage and Ends with an Addition of divorce”: Learning State Changing Verbs from Wikipedia Revision History. (PDF). Proceedings of EMNLP 2015. Lisbon, Portugal. pp. 518–523. 
  13. Shan Liu, Mizuho Iwaihara: Extracting Representative Phrases from Wikipedia Article Sections, DEIM Forum 2016 C3-6.
  14. Cannaviccio, Matteo; Barbosa, Denilson; Merialdo, Paolo (2016). “Accurate Fact Harvesting from Natural Language Text in Wikipedia with Lector”. Proceedings of the 19th International Workshop on Web and Databases. WebDB ’16. New York, NY, USA: ACM. doi:10.1145/2932194.2932203. ISBN 9781450343107.  Closed access
  15. Ekenstierna, Gustaf Harari; Lam, Victor Shu-Ming. Extracting Scientists from Wikipedia. Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, Proceedings of the Workshop, July 11, 2016, Krakow, Poland. 
  16. Lowe, Daniel M.; O’Boyle, Noel M.; Sayle, Roger A. “LeadMine: Disease identification and concept mapping using Wikipedia” (PDF). Proceeding of the fifth BioCreative challenge evaluation workshop. BCV 2015. pp. 240–246. 
  17. Shuang Sun, Mizuho Iwaihara: Finding Member Articles for Wikipedia Lists. DEIM Forum 2016 C3-3.
  18. Martín Curto, María del Rosario (2016-04-15). “Estudio sobre el contenido de las Ciencias de la Documentación en la Wikipedia en español” (info:eu-repo/semantics/bachelorThesis).  thesis, University of Salamanca, 2014

Wikimedia Research Newsletter
Vol: 6 • Issue: 8 • August 2016
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]