Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png

Vol: 5 • Issue: 9 • September 2015 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

Wiktionary special; newbies, conflict and tolerance; Is Wikipedia’s search function inferior?

With contributions by: Federico Leva, Panda10, Piotr Konieczny, Trey Jones and Tilman Bayer

“Teaching Philosophy by Designing a Wikipedia Page”

Wikipedia research still is not often seen in the book form. Here’s one of the rare exceptions: a book chapter on “Teaching Philosophy by Designing a Wikipedia Page”.[1] It is an essay in which the author describes his experiences in teaching a class with a “write a Wikipedia article” assignment; specifically starting the Collective intentionality page. The students worked in teams, each tasked with improving a different part of the article (from separate parts of the literature review to ensuring that the article conforms to different elements of Wikipedia’s manual of style). The end result was quite successful: a well-written new Wikipedia entry (see here revision as of the time the article was last edited by the instructor in January 2013) and the students seemed to have expressed positive assessments, particularly with regards to having an impact on the real world (i.e. creating a publicly visible Wikipedia article). The author concludes that the students benefit both from contributing to public knowledge, and by learning how public knowledge is created.

Unfortunately, it appears that (as is still too often the case) the author (Graham Hubbs of the University of Idaho, presumably User:Phil(contribs)) was not aware of the Wikipedia:Education Program, as no entry for the course was created at the Wikipedia:School and university projects. It may therefore be wise for the editors associated with the Wiki Education Foundation (some of whom, I hope, are reading this) to pursue this and contact the author – as someone who was quite happy with his first experiment in teaching with Wikipedia, he may be happy to learn we offer extensive support for this (at least, as far the US goes). On a final note, I do observe, sadly, that neither the instructor, nor any of the students seem to have kept editing Wikipedia after the course was over (outside a single edit here), which seems to be a too-common case with educational assignments in general.

Wikipedia Search Isn’t Necessarily Third BESt

How tall is Claudia Schiffer? And how to find out on Wikipedia?

Review by Trey Jones (WMF Discovery department)

What’s the best way to use Wikipedia to answer questions like, “How tall is Claudia Schiffer?” or “Who has Tom Cruise been married to?”—and what tools can make this easier?

In their paper, “Expressivity and Accuracy of By-Example Structured Queries on Wikipedia,”[2] Atzori and Zaniolo seek to compare their query-answering system—“the ‘’By-Example Structured (BESt) Query’’ paradigm implemented on the SWiPE system through the Wikipedia interface”—against “Xser, a state-of-the-art Question Answering system”, and against “plain keyword search provided by the Wikipedia Search Engine.” Their results on a standard set of question answering tasks from QALD put SWiPE on top, with F-measure scores for SWiPE, Xser, and Wikipedia at 0.88, 0.72, and a dismal 0.18, respectively.

Their approach is based on transforming Wikipedia infoboxes into editable templates that serve as a front end for SPARQL queries run against RDF triples (subject–predicate–object expressions) stored in DBpedia. It is a novel approach that suggests a number of other avenues for improving search and discovery on Wikipedia and elsewhere. However, their methods and results are incommensurable both to Xser and to Wikipedia’s native keyword search.

In an earlier paper on SWiPE,[supp 1] the authors describe the need for custom (“page-dependent”) mappings from any given infobox element to the appropriate internal representations for mapping to SPARQL/RDF. These mappings appear to have all been created manually. Given these behind-the-scenes mappings from infoboxes to RDF elements, a user, working by analogy from an existing infobox, maps query concepts to the appropriate infobox element.

BESt/SWiPE thus pushes much of the language and conceptual processing—tasks at which humans excel—into the human user: the human chooses an existing Wikipedia entry on an appropriate analogous topic, pulls out relevant entities and relationships from the text of the query, and maps them to appropriate infobox components. These tasks can be non-trival. Answering a question like “Who has Tom Cruise been married to?”, for example, requires mapping “Tom Cruise” to the relevant category of, say, actor, finding another actor to use as a template, and mapping the “married to” relationship in the query to the “Spouse(s)” element of the infobox.

Contrast this with Xser,[3] which uses natural language processing to automatically parse a given query and convert it into a structured format, which is then automatically mapped to a structured query (e.g. SPARQL) against a knowledge base (e.g., DBpedia)—all independent of any human posing or reading a given query or mapping to KB elements. The comparison is thus more properly between BESt/SWiPE + a human and Xser, in which case it is less surprising that BESt/SWiPE comes out on top.

The comparison to the “plain keyword search provided by the Wikipedia Search Engine” is similarly disingenuous. The authors extracted search terms from the set of queries they investigated, apparently manually, but without the level of insight into natural language (or Wikipedia!) that is required in the BESt/SWiPE workflow, given the mappings of infobox elements to conceptual categories and the parsing of queries to map them to infobox elements.

The authors’ translation of questions into keywords for Wikipedia queries is sophisticated from a language processing point of view, but naive from a search point of view. “How tall is Claudia Schiffer?” became search terms (Claudia Schiffer, tall), though any sophisticated searcher should know that height is usually listed under “height”, not “tall”. (The query still works because it gets to the Claudia Schiffer wiki page, despite the distractor term “tall”.) They drop the word “produce” from a question about where beer is produced, but leave it in for a producer (but don’t use “producer”, which is the expected specific title to be found on that person’s wiki page).

More generally, when searching Wikipedia, the authors fail to note when a question is fundamentally about the basic properties of a given entity, and so any search terms other than the name of that entity is a distraction in that search. (E.g., “How tall is Claudia Schiffer?” is about Claudia Schiffer, “Which river does the Brooklyn Bridge cross?” is about the Brooklyn Bridge, “In which U.S. state is Mount McKinley located?” is about Mount McKinley.) No human user familiar with Wikipedia (or even a dead-tree encyclopedia) would search for “Claudia Schiffer, tall” when asked to find out how tall she is.

The authors also fail to take advantage of any knowledge about the typical structure and content of Wikipedia, and so don’t search for the obvious “list of X” articles that often answer the questions with sortable tables that any frequent user of Wikipedia (much less a editor and contributor) would be very familiar with. As an example, mapping the question “Which U.S. state has the highest population density?” to the search “list of U.S. states by population density” is natural—and it happens to be an exact match to a page I’d never seen before, but surmised was likely to exist).

The authors do afford considerable sophistication to their hypothetical BESt/SWiPE user, who knows, for example, to model the query to answer “Which books by Kerouac were published by Viking Press?” on a book, rather than on an author. It makes sense in retrospect, considering the information available in a book infobox, but my first inclination was that this was a question about an author, and an author infobox is insufficient for this question.

Again, the results attained by Wikipedia + naively extracted queries and BESt/SWiPE + a sophisticated human are incommensurate. A sophisticated Wikipedian + Wikipedia would fair much better than the poor 18% F-measure reported by Atzori & Zaniolo. And, it seems likely that a relatively sophisticated Wikipedian is easier to come by than someone who can map queries to example entities and their infobox components after having mapped infobox components to RDF entities and relationships.

To be fair, keyword searches on Wikipedia can’t readily answer questions that do not appear on a single page in Wikipedia. Some answers would be very tedious indeed to determine, such as “Give me all people that were born in Vienna and died in Berlin”, because they require collating information across many pages. But that’s exactly the kind of information about relationships between entities—and even chains of relationships among different kinds of entities—that one expects to be extracted via SPARQL from RDF triples in a data store such as DBpedia or Wikidata.

Finding ways for users to productively access such structured information—be it through natural language processing as with Xser, through structured by-example queries as with BESt/SWiPE, or other approaches—is a worthy goal; but it is only fair to compare approaches that operate in the same general realm in terms of available automation and necessary user sophistication.

More newbies mean more conflict, but extreme tolerance can still achieve eternal peace

An article titled “Modeling social dynamics in a collaborative environment”,[4] published last year in the Data Science section of the European Physical Journal, describes a simplified numerical model for how Wikipedia’s coverage of contentious topics may develop over time. It presents evidence that this model matches some aspects of real-life edit wars and debates.

The opinions of editors on a particular issue are modeled as a one-dimensional variable: “In the Liancourt Rocks territorial dispute between South Korea and Japan|, for example, the values x=0,1 represent the extreme position of favoring sovereignty of the islets for a particular country”. Somewhat contrary to Wikipedia’s neutral point of view (NPOV) policy, which is never mentioned in the paper, the authors assert that an article’s coverage of such a topic always expresses a particular opinion too, likewise modeled as a point on this scale.

The paper first considers pairwise encounters between editors (“agents”) where “people with very different opinions simply do not pay attention to each other, but similar agents debate and converge their views” by a certain amount that is governed by a parameter describing how “stubborn” opinions are. This is a well-studied model of opinion dynamics, known as “bounded confidence” (for the “confidence” or “tolerance” parameter that describes the limit until which agents are similar enough to still influence each other). It also matches the description of inelastic collisions of two particles in certain kinds of gases in statistical physics.

To describe the interaction of an editor (“agent”) with an article (“medium”), a second kind of dynamic is introduced in the model. Here, the equations state that editors will change an article if it differs too much from their own opinion (as defined by a second tolerance parameter), but will change their opinion towards the article’s if they already have a similar opinion.

Temporal evolution of the opinions of editors (green) and the article (red) for different values of the tolerance and stubbornness parameters, assuming a fixed community of editors

The numerical simulation of an article’s history consists of discrete steps combining both dynamics: interactions between editors (for example on talk pages) and an edit made by an editor to the article.

For a “fixed agent pool” where no editors join or leave, it can be shown that “the dynamics always reaches a peaceful state where all agents’ opinions lie within the tolerance of the medium”. The authors note that this “contrasts drastically with the behavior of the bounded confidence mechanism alone, where consensus is never attained” (unless the tolerance parameter is large). In other words, the interaction on the article as the shared medium sets Wikipedia apart from systems that only support discussion (Usenet flamewars come to mind).

However, depending on the values of the tolerance and stubbornness parameters, this eventual “peaceful state” can take a long time to reach, with various possible dynamics – see the figure from the authors’ simulations on the right. They note that “Quite surprisingly, the final consensual opinion [does not need to lie in the middle, or match] that of the initial mainstream group, but [can sometimes be] some intermediate value closer to the extremist groups at the boundaries.”

More newbies mean more conflict (right hand side), but eternal peace can still be achieved if editors tolerance for leaving differing opinions in the article is high enough (top).

Going beyond the simplified “fixed agent pool” assumption, the authors note that “in real WP articles the pool of editors tends to change frequently … Such feature of agent renewal during the process or writing an article may destroy consensus and lead to a steady state of alternating conflict and consensus phases, which we take into account by introducing thermal noise in the model.” Whether permanent consensus is still eventually reached, or how long it lasts before it is interrupted by periods of conflict, depends on the parameters, including the rate at which newbies enter the community.

Distribution of the length of “peace periods” in the history of three articles (square dots) and in the paper’s theoretical simulations (lines)

In the last part of the paper, the authors compare their theoretical model with actual revision histories of articles on the English Wikipedia. They use a numerical measure of an article’s “controversiality”, introduced by some of the group in an earlier paper (see review: “Dynamics of edit wars“). It basically counts reverts, but weighs reverts between experienced editors higher. The development of this number over time describes periods of conflict and peace in the article. The authors state that using this metric, almost all controversial articles can be classified by three scenarios:

(i) Single war to consensus: In most cases controversial articles can be included in this category. A single edit war emerges and reaches consensus after a while, stabilizing quickly. If the topic of the article is not particularly dynamic, the reached consensus holds for a long period of time [… Example: Jyllands-Posten Muhammad cartoons controversy]
(ii) Multiple war-peace cycles: In cases where the topic of the article is dynamic but the rate of new events (or production of new information) is not higher than the pace to reach consensus, multiple cycles of war and peace may appear [… Example: Iran].
(iii) Never-ending wars: Finally, when the topic of the article is greatly contested in the real world and there is a constant stream of new events associated with the subject, the article tends not to reach a consensus [… Example: Barack Obama]

For their theoretical agent/medium model, the authors define an equivalent of this controversiality measure, and find that it “closely reproduce[s its] qualitative behavior […] for different war scenarios” in numerical simulations.

A preprint letter with the same title involving the same authors, announcing some of the paper’s results, was covered in our July 2012 issue.

Predicting Wikimedia pageviews with 2% accuracy

A graph of Wikimedia pageviews since April 2015 (mobile vs. all, using the new page view definition data that was not yet available to the authors)

A 2014 conference paper[5], recently republished as part of a dissertation in computer science, analyzed more than five years of hourly traffic data published by the Wikimedia Foundation, as part of an effort to develop methods for better predicting workloads of web servers. The authors call it “the longest server workload study we are aware of”. From the abstract:

“With descriptive statistics, time-series analysis, and polynomial splines, we study the trend and seasonality of [Wikimedia traffic], its evolution over the years, and also investigate patterns in page popularity.
Our results indicate that the workload is highly predictable with a strong seasonality. Our short term prediction algorithm [one week ahead] is able to predict the workload with a Mean Absolute Percentage Error of around 2%.”

The study decomposed the time series of pageview numbers into several components:

  • a seasonality component with daily and weekly periods (without yearly parts in the presented example, as it covered only a little over a month), estimated by fitting cubic splines
  • a trend line approximated with a piecewise linear function
  • and a remainder modeled with an ARIMA (Autoregressive Integrated Moving Average) model using the R forecast software package.

(Readers might also be interested in a recently announced online traffic forecast application by the WMF Research and Data team, which likewise uses an ARIMA model, and allows predicting traffic for individual projects, but is based on coarser monthly time series.)

The study acknowledges that both the website’s content and its server setup changed a lot over the examined timespan (May 2008 – October 2013), with the number of Wikipedia articles roughly tripling, and e.g. the main hosting site moving from Florida to Virginia and a separate server site in Korea closing. The authors also observe that traffic “dynamics changed tremendously during the period studied with visible steps, e.g., at the end of 2012 and early 2013” (where their diagram – Figure 2(a) on p III.4 – shows large upwards spikes), which “suggests a change in the underlying process of the workload. In this case trying to build a single global model can be deceiving and inaccurate. Instead of building a single global model, we modeled smaller periods of the workload where there was no significant step.” This makes the work somewhat less interesting for those who are interested in longer-term strategic predictions rather than short-term allocation of server resources. On the other hand, such deviations from a prediction model could potentially be used in reverse to identify such “a change in the underlying process” (e.g. software changes affecting reader experience or a web censorship effort), or provide evidence for its impact on traffic. The authors’ own use case requires such detection for the case of short-term upward outliers (those that increase server load), enabling a quick change of the prediction model.

The paper discusses two examples of such unexpected spikes, the 2009 death of Michael Jackson that overloaded WMF servers, and the Super Bowl XLV in 2010.

Another chapter concerns the list of the 500 most popular pages. The authors found that it is highly volatile, with “41.58% of the top 500 pages joining and leaving the top 500 list every hour, 87.7% of them staying in the top 500 list for 24 hours or less and 95.24% of the top-pages staying in the top 500 list for a week or less.”

The freely available traffic data from the Wikimedia Foundation also features prominently in a draft publication included in the same dissertation as “Paper VI”[6] which examines the performance of algorithms that automatically scale server resources to changing traffic, “using 796 distinct real workload traces from projects hosted on the Wikimedia foundations’ servers”. Having found in that paper that “it is not possible to design an autoscaler with good performance for all workloads and all scenarios”, another draft publication (included as “Paper VII” in the dissertation)[7] “proposes WAC, a Workload Analysis and Classification tool for automatic selection of a number of implemented cloud auto-scaling methods.” Using machine learning methods, this classifier is trained on several datasets including again “798 workloads to different Wikimedia foundation projects” (mentioning the French “Wikitionary” [sic] as one example). The authors remarks that “we have performed a correlation analysis on the selected [Wikimedia] workloads, and we found that they are practically not correlated”.

The dissertation was defended this week at Umeå University in Sweden. A startup has been founded based on the research results, which has also patented some of them.

Wiktionary special

Wiktionary logo
Wiktionary logo

While most of the research featured in this newsletter examines Wikipedia, other Wikimedia Foundation projects have attracted researcher attention too. Below we present a roundup of recent research about Wiktionary, plus one older paper. See also our earlier coverage of Wiktionary-related research.

“Online dictionaries in Web 2.0 platform – Wikiszótár and Wiktionary”

This Hungarian-language paper[8] (with an English abstract) was published in December 2010 in the “Review” section of the Journal of Hungarian Terminology. Its aim is to provide an introduction to online collaborative dictionaries, Web 2.0, and the wiki platform using the Hungarian and English Wiktionary as an example. In the section that discusses dictionary criticism, the author notes that a systematic, generally agreed upon set of criteria for evaluating online dictionaries has yet to be developed, so he conducts the evaluation based on methods originally designed for printed dictionaries. The article describes the elements of Wiktionary’s structure and content in detail, and compares the two Wiktionaries to each other and to printed dictionaries. This can be useful information for someone who is not familiar with online collaborative dictionaries and specifically with Wiktionary. Some of the menu items have changed in the past five years, and a huge amount of content was added, but the overall structure – while more refined – is still the same.

The detailed analysis include: the megastructure (the navigation menu items, each listed and briefly explained); the macrostructure (the arrangement of words, finding an entry by search or by browsing categories or by clicking hyperlinks); the microstructure (the composition of a lemma entry, the sections within the entry, the quality and content of each section); the mesostructure (the system of hyperlinks, internal and external references, as perhaps the most important advantages of an online dictionary). Two screen shots are provided: one for the Hungarian word “ablak” (“window”) from the Hungarian Wiktionary, and one for the English word “window” from the English Wiktionary. The examples chosen are similar in their level of detail to make the comparison valid.

The paper states that the biggest challenge of online collaborative dictionaries is the reliability of information. The content of printed dictionaries is created and reviewed by professionals. Online collaborative dictionaries can be edited by anyone. It is added, however, that even printed dictionaries contain inaccuracies, not to mention that the addition of new terminology can take years.

The conclusion is that the innovative nature of online dictionaries compared to traditional dictionaries is epoch-making, and their practical value is indisputable. Not necessarily in content (although the quantity of processed information is enormous), but more in the hyperlinks (including audio files and images), ease of use, wide availability, and free access. Compared to printed dictionaries, they are dynamic. Their content can be increased theoretically without limit and the information can be updated any time.

“GLAWI, a free XML-encoded Machine-Readable Dictionary built from the French Wiktionary”

The paper[9] recaps some previous publications of the same authors and reports on the publication of yet another dataset extracted from Wiktionary, but one of unusual size. The authors, across six years, mapped six thousand templates of the French Wiktionary (Wiktionnaire) and implemented various mechanisms to standardize its content, which, together with some manual correction, allowed them to produce a machine-readable dictionary of over 1.3 million entries under free license.

According to the authors, the dataset can be used to easily produce specialised lexicons and thesauri superior not only to the rather neglected French WordNet but even to a monstre sacré like the digital Trésor de la langue française. In fact, they report that Wiktionnaire contains only sixty-five entries with contradictory irreconcilable information. According to the authors, Wiktionary editors may want to adopt some of their standardizations and corrections, but need not be pushed to do so, because Wiktionary serves its purpose well by having little constraints and maximising participation, while standardization can be performed downstream.

Sadly, it’s hard to assess the added value provided by this effort, as the paper features no comparison to other efforts and proposals, such as DBpedia Wiktionary or Wikidata’s own proposal for a Wiktionary data mapping. However, it’s useful as confirmation of (the French) Wiktionary’s quality and as promotion/redistribution of its content.

“IWNLP: Inverse Wiktionary for Natural Language Processing”

This conference paper[10] reports on the more engineering-oriented IWNLP free software project. It is an XML dump parser which is in earlier stages of development than GLAWI, and specifically focused on the German Wiktionary (unlike a predecessor, the “Java Wiktionary Library” known as JWKTL). From 400k entries of the 2015-04 dump, 74k words and 281k word forms were extracted, reaching higher accuracy than previous resources for the lemmatization of nouns but low accuracy for adjectives and verbs; a thesaurus was not created yet. Interestingly, the authors made 200 edits to German Wiktionary entries in the process.

“knoWitiary: A Machine Readable Incarnation of Wiktionary”

This paper (pre-print?)[11] presents another attempt at producing an XML dump parser for Wiktionary superior to JWKTL. This effort focuses on a 2014 dump of the English Wiktionary, from which about 530k words and 550k meanings are extracted for Italian, about 580k and 700k respectively for English. However, there is no mention of code or dataset release, nor of whether the parser was an improvement on previous ones; DBpedia Wiktionary is not mentioned at all. The English WordNet is shown to cover only half of said terms, with lower comprehensiveness and small overlap. Wiktionary offers some unique strengths which allow novel applications: in particular information on etymology, compounding and word derivation.

In short, unclear reusability but one more point in the long list of papers showing that Wiktionary is a mature or superior competitor for most expert-built dictionaries, lexicons, thesauri etc.

“Zmorge: A German Morphological Lexicon Extracted from Wiktionary”

This conference paper[12] again features an extraction from the German Wiktionary. This time the objective is a German lexicon/finite state morphology analyser to replace Morphisto, an unfree German resource built on SMOR. Building upon an existing module (SLES), a fully automated extraction produces a SMOR grammar lexicon with about 70 thousands entries; quality is higher than in past work, which was based on raw text, because Wiktionary features information like part-of-speech, stem, gender or case. The lexicon’s results are assessed against a manually annotated resource and succeed in overcoming the Morphisto lexicon, while the Stuttgart lexicon is still better by a few percentage points.

The precision achieved is 1.3 percentage points higher than it would be with a dump 15 months older and most errors are simply caused by the lack of a certain word in the German Wiktionary; this suggests such a Wiktionary-based approach will soon overcome its unfree competitors in yet another field of linguistic resources. Datasets and code are published.

“Dbnary: Wiktionary as Linked Data for 12 Language Editions with Enhanced Translation Relations”

This conference paper[13] presents a free software (LGPL) tool[supp 2] to extract a lexical ontology from Wiktionary.[note 1] (See also our 2012 coverage of an earlier paper about the project: “Generating a lexical network from Wiktionary”)

Non-inflected terms in twelve languages are extracted from the respective Wiktionaries and linked by their relation (being a translation one of the other, being a synonym etc.). The authors claim their parsing is general enough to work in those twelve languages and resist to changes in markup, but it’s not clearly explained how and quality was not assessed at all.[note 2] The work can be considered a conversion from wikitext format to RDF of the most basic linguistic information in Wiktionary, interesting insofar extensible to all languages, but the resulting dataset is not usable as is without further research.

“Observing Online Dictionary Users: Studies Using Wiktionary Log Files”

This paper[14] is based on the familiar pageviews data used by (see FAQ) for the German Wiktionary, possibly the only general-purpose dictionary for which such data is publicly available. Out of 350k entries, of which 200k were classified as German words, authors work on a set of 56k which satisfy several criteria: being lemmas of the German corpus DeReKo, having more than 11 monthly visits and a sufficient definition. The excluded entries were checked and found to be mostly inflected forms, geographic and proper names and terminological nouns. High frequency in the corpus is confirmed to be associated to high pageviews/look-ups; a set of entries selected by corpus frequency is found to have more pageviews than a set of random entries (quite a weak finding).

This method tells us little about Wiktionary, as we’re not even told what portion of pageviews is covered by the set of entries in question. However, it’s useful to confirm some assumptions used in compiling traditional dictionaries. Conversely, Wiktionary covers nearly all the words a traditional dictionary would. The only useful finding for Wiktionary is that dozens of words of the basic German vocabulary (as compiled by the Goethe-Institut for B1[supp 3]) are still missing from this Wiktionary set. The list of red links should be placed on wiki.

The authors then attempt to prove that entries with more than one definition (“polysemic“) are more visited that entries with a single definition (“monosemic”), by noting that in groups of words with similar corpus frequency the “polysemic” entries are on average more visited than the “monosemic” entries. This reviewer‘s statistical knowledge is insufficient to determine whether normalizing pageviews by corpus frequency would have been more reliable than this “parallelization strategy”. However, it’s natural for entries to grow in size and number of definitions proportionally to their number of visits, whatever the merit of such a growth, so this “result” is dubious.

Finally, authors unsurprisingly show that some entries have bursts of visits far beyond their trend, linked to events and news.

In brief

“Multilingual Open Relation Extraction Using Cross-lingual Projection”

Short paper[15] on the extraction of semantic statements à la Wikidata (like “Ottawa”, “is capital of”, “Canada”) from free text, or open domain relation extraction. Text is translated from the source language to English, then existing English parsers for the purpose are used; Wikipedia in French, Hindi and Russian was used as example source and the results manually annotated to verify accuracy: 82, 64 and 64 % respectively. It’s not reported how wikitext was transformed into plain text. A dataset of samples in 60 languages was released under free license, but accuracy is still far from Wikidata’s AI ingester, Kian (arguably a closed domain extractor hence “easier”).

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

  • “Methods in collaborative dictionaries”[16] Based on an examination of English and German Wiktionary. From the abstract: “We are particularly interested in the question to what extent they differ from the methods of expert lexicographers and how editorial dictionaries can leverage the user-generated data. …. For collaborative dictionaries, it is […] essential to encourage discussion, define transparent decision workflows, and continually motivate the authors. The large user communities provide a high coverage of language varieties, translations, neologisms, as well as personal and spoken language, which often lack corpus evidence. […] we see great potential in the cooperation between expert lexicographers and collaborative user communities.”
  • “How news media trigger searches and edits in Wikipedia”[17]
  • “Adding High-Precision Links to Wikipedia”[18] From the abstract: “… we study how to augment Wikipedia with additional high-precision links. We present 3W, a system that identifies concept mentions in Wikipedia text, and links each mention to its referent page. … Our experiments demonstrate that 3W can add an average of seven new links to each Wikipedia article, at a precision of 0.98.”
  • “Editorial Bias in Crowd-Sourced Political Information”[19] From the abstract: “By randomly assigning factually true but either positive or negative and cited or uncited information to the Wikipedia pages of U.S. senators, we uncover substantial evidence of an editorial bias toward positivity on Wikipedia: Negative facts are 36% more likely to be removed by Wikipedia editors than positive facts within 12 hours and 29% more likely within 3 days. Although citations substantially increase an edit’s survival time, the editorial bias toward positivity is not eliminated by inclusion of a citation. We replicate this study on the Wikipedia pages of deceased as well as recently retired but living senators and find no evidence of an editorial bias in either. Our results demonstrate that crowd-sourced information is subject to an editorial bias that favors the politically active.” (See also comments on the Wiki-research-l mailing list)


  1. Graham Hubbs: “Teaching Philosophy by Designing a Wikipedia Page” book chapter of Experiential Learning in Philosophy, edited by Julinna Oxley, Ramona Ile, Routledge 2015, ISBN 9781138927391, 222-227
  2. Maurizio Atzori and Carlo Zaniolo (2015). “Expressivity and Accuracy of By-Example Structured Queries on Wikipedia“. Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2015 IEEE 24th International Conference on: 239-244. 
  3. Kun Xu, Yansong Feng, and Dongyan Zhao (2014). “Xser@QALD-4: Answering Natural Language Questions via Phrasal Semantic Parsing“. CLEF2014 Working Notes: 1260-1274. 
  4. Iñiguez, Gerardo; János Török, Taha Yasseri, Kimmo Kaski, János Kertész (2014-09-24). “Modeling social dynamics in a collaborative environment“. EPJ Data Science 3 (1): 1-20. doi:10.1140/epjds/s13688-014-0007-z. ISSN 2193-1127.  Open access
  5. A. Ali-Eldin, A. Rezaie, A. Mehta, S. Razroevy, S. Sj ̈ostedt-de Luna, O. Seleznjev, J. Tordsson, and E. Elmroth. How will your workload look like in 6 years? analyzing wikimedia’s workload. In: Proceedings of the 2014 IEEE International Conference on Cloud Engineering (IC2E), pages 349-354, IEEE Computer Society, 2014. Reproduced in: Ahmed Ali-Eldin Hassan. Workload Characterization, Controller Design and Performance Evaluation for Cloud Capacity Autoscaling. PhD thesis, 2015, Department of Computing Science, Umea University. PDF, p.77
  6. A. Papadopoulos, A. Ali-Eldin, J. Tordsson, K.E. Arźen, and E.Elmroth. PEAS: A Performance Evaluation framework for Auto-Scaling strategies in cloud applications. “Submitted for Journal Publication.” Reproduced in: Ahmed Ali-Eldin Hassan. Workload Characterization, Controller Design and Performance Evaluation for Cloud Capacity Autoscaling. PhD thesis, 2015, Department of Computing Science Umea University PDF
  7. A. Ali-Eldin, J. Tordsson, E. Elmroth, and M. Kihl. WAC: A Workload analysis and classification tool for automatic selection of cloud auto-scaling methods. “To be submitted”. Reproduced in: Ahmed Ali-Eldin Hassan. Workload Characterization, Controller Design and Performance Evaluation for Cloud Capacity Autoscaling. PhD thesis, 2015, Department of Computing Science Umea University PDF
  8. Gaál, Péter (2010-12-21). “Online szótárak a Web 2.0 platformon – A Wikiszótár és a Wiktionary”. Magyar Terminológia (Journal of Hungarian Terminology) 3 (2): 251–268. doi:10.1556/MaTerm.3.2010.2.7. ISSN 2060-2774. 
  9. Franck Sajous and Nabil Hathout: GLAWI, a free XML-encoded Machine-Readable Dictionary built from the French Wiktionary
  10. Matthias Liebeck and Stefan Conrad: IWNLP: Inverse Wiktionary for Natural Language Processing. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 414–418, Beijing, China, July 26-31, 2015. PDF
  11. Vivi Nastase and Carlo Strapparava. “knoWitiary: A Machine Readable Incarnation of Wiktionary“. FBK-irst, Trento, Italy. 
  12. Rico Sennrich, Beat Kunz. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary“.  dataset and code
  13. Gilles Serasset, Andon Tchechmedjiev. Dbnary: Wiktionary as Linked Data for 12 Language Editions with Enhanced Translation Relations. 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing, May 2014, Reyjkjavik, Iceland.
  14. Müller-Spitzer, Carolin; Sascha Wolfer, Alexander Koplenig (2015-02-10). “Observing Online Dictionary Users: Studies Using Wiktionary Log Files“. International Journal of Lexicography: 029. doi:10.1093/ijl/ecu029. ISSN 0950-3846. 
  15. Manaal Faruqui and Shankar Kumar (2015). “Multilingual Open Relation Extraction Using Cross-lingual Projection“. Proceedings of NAACL. , also blog.
  16. Meyer, Christian M.; Iryna Gurevych (2014). “Methoden bei kollaborativen Wörterbüchern [Methods in collaborative dictionaries / Méthodes dans le domaine des dictionnaires collaboratifs]”. Lexicographica 30 (1): 187-212. doi:10.1515/lexi-2014-0007. ISSN 1865-9403.  Closed access (in German, with English abstract)
  17. Stefan Geiß, Melanie Leidecker, and Thomas Roessing: The interplay between media-for-monitoring and media-for-searching: How news media trigger searches and edits in Wikipedia. New Media & Society 1461444815600281, first published on August 21, 2015 DOI:10.1177/1461444815600281 Closed access
  18. Thanapon Noraset, Chandra Bhagavatula, Doug Downey: Adding High-Precision Links to Wikipedia. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 651–656, October 25-29, 2014, Doha, Qatar. PDF
  19. Kalla JL, Aronow PM (2015) Editorial Bias in Crowd-Sourced Political Information. PLoS ONE 10(9): e0136327.doi:10.1371/journal.pone.0136327
Supplementary references and notes:
  1. Maurizio Atzori and Carlo Zaniolo (2012). “SWiPE: Searching Wikipedia by Example“. Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume): 309–312. 
Remarks and annotations:
  1. The wikitext in the XML dumps is accessed with the Bliki engine and parsed by dbnary to produce a LMF structure stored in RDF.
  2. Wiktionary interwikis, used by the authors, don’t give any information on words: they merely link entries with identical titles i.e. homographs.

Wikimedia Research Newsletter
Vol: 5 • Issue: 9 • September 2015
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]