Photo by Patrick Tomasso via Unsplash, published prior to 5 June 2017, CC0.

On the Spanish Wikipedia, at the top of the list of most-cited sources in articles you’ll find: a “Catalog of Fishes”, a dictionary of minor planets, an encyclopedia of Argentinian films, a field guide to the songbirds of South America, and an atlas of Spanish popular culture.

Citations are the foundation of Wikipedia’s reliability: they trace the connection between content added by our community of volunteer contributors and its sources. For readers, citations provide a mechanism to validate and check for themselves that what Wikipedia says is sound and trustworthy: they act as a gateway towards a broader ecosystem of reliable knowledge. In an effort to spearhead more research on where Wikipedia gets its facts from, and to celebrate Open Citations Month, we asked ourselves:  what are the most cited sources across all of Wikipedia’s language editions?

To answer this question, we published a dataset of every citation referencing an identifier across all 297 Wikipedia language editions. The dataset breaks down sources cited in each language by identifier–a PMID or PMC (for articles in the biomedical literature), a DOI (for scholarly papers), an ISBN (for book editions), or an ArXiV ID (for preprints).

What’s in the data?

The full dataset, extracted from the March 1, 2018 Wikipedia content dumps, includes a total of 15,693,732 records and shows important variations across languages in the kind of sources volunteer contributors cite. The dataset also only includes citations by identifier, which means not all citations on Wikipedia are reflected in the dataset; many more publications than the records included in this dataset are cited that don’t reference any identifier (and our next analysis will be able to tell you what percentage of total citations this dataset represents).

What types of sources are cited the most by language?

On average, the majority of publications cited by identifier across Wikipedia language editions are books. German Wikipedia – one of the top 5 language editions by number of articles – relies primarily on information sourced to book editions, with 87% of citations in the ISBN category. Conversely, English Wikipedia sources its information equally on scholarly publications and books, while Arabic Wikipedia uses more scholarly publications than books.

Preprint repositories such as ArXiv, represent a minority of publications, with less than 2% of citations in each language, and they are most prominently cited in Arabic Wikipedia. At least 5% of publications in Arabic and English Wikipedia are open access biomedical publications from PubMedCentral.

How fast are citations growing by language?

If we look at the percentage of total citations added over time, we note that some languages such as Arabic and Spanish are on a steady growth trajectory as of early 2018, while the general trend (black line) is flattening. Since the number of articles across all languages continues to grow, this suggests that in some languages the rate of citation is slowing down.

How often are sources cited and reused across articles and languages?

There are 4.5 million unique sources in the datasets. While on average, every source is cited 3.5 times, the vast majority of sources in this dataset are used less than 500 times across wikis. Only nine “super publications”’ are used more than 10,000 times.

What are the most cited sources?

Unsurprisingly, Wikipedians love reference works. The top 10 sources by citation across every Wikipedia language are all reference books or scientific articles describing large collections. Many of these publications have been cited by Wikipedians across large series of articles using powerful bots and automated tools.

  1. Updated world map of the Köppen-Geiger climate classification:  2,830,341 citations  []
  2. Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragment Methods: An Analysis of AlogP and CLogP Methods:  21,350 citations []
  3. The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC):  20,247 citations []
  4. The de Vaucouleurs Atlas of Galaxies:  19,068 citations [ISBN: 9780521820486]
  5. The Complete New General Catalogue and Index Catalogues of Nebulae and Star Clusters by J. L. E. Dryer:  19,060 citations [ISBN: 9780933346512]
  6. Galaxies and How to Observe Them:  19,058 citations [ISBN: 9781852337520}
  7. A Concise History of Romania:  15,597 citations [ISBN: 9780521872386]
  8. Catalog of Fishes California Academy of Sciences:  11,980 citations, [ISBN: 0940228475]
  9. Dictionary of Minor Planet Names:  10,651 citations [ISBN: 9783540002383]
  10. National and religious composition of the population of Croatia, 1880-1991: By settlements:  8,230 citations [ISBN: 9789536667079]

Why does this data matter?

First off, it allows us to analyze, at scale, where Wikipedia gets its information from. Understanding the provenance of information used by Wikipedians, also allows us to lift a veil on its gaps—the types of sources, languages, and perspectives—that are not represented, which in turn can inform community efforts to improve coverage in underserved content areas. The data can also be reused by partners such as publishers, scholarly societies, and research projects to better understand how their works are used and found by the public.

As a freely licensed, CCO dataset, we hope researchers and partners will re-use and analyze this corpus for trends of interest to their fields and research projects. Critically, a list of the most cited sources also enables partners to try and make more or most of them accessible to readers. We can help drive digitization and open access efforts geared towards making the most commonly-cited sources free to access online.

Finally, with citations as an indicator of factual currency, knowing what works are supporting our shared knowledge gives us a glimpse into popular understanding–both how we know what we know, and what we know most about.

Reactions from friends and partners

Since this dataset should empower others to do their own analysis and incorporate insights from Wikipedia’s citations, we asked some of our friends and partners what they thought.

“Wikipedia plays a crucial role in democratizing access to knowledge and enriching our understanding of the world,“ said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition). “This new citation dataset provides a deeper level of transparency and trustworthiness to its content, and opens exciting new paths for people learn, innovate and follow their curiosity.”

Geoffrey Bilder, Director of Strategic Initiatives at Crossref, said: “We are delighted to see the Wikimedia Foundation release this dataset that shows which research is most often cited in Wikipedia articles. Over the past ten years, we’ve been monitoring the rapid growth of links between Wikipedia and research outputs. It appears that Wikipedia is increasingly taking on the role of the “review article” and is set to become the de facto starting place for the researchers exploring subjects they are unfamiliar with. This means Wikipedia has become a vital gateway that drives users to published research articles and, as such, it has become one of the top referrers of DOIs in the world.”

Brewster Kahle, digital librarian, looked ahead: “At the Internet Archive, we believe in the value of verifiable information. We plan to use the citation data that Wikimedia Foundation has released to inform our digitization priorities, making the most important books available to researchers worldwide. We envision a future where every citation and reference in Wikipedia is a live link into a trusted repository like the Internet Archive, empowering every Wikipedia user to fact-check and verify the information they encounter online.”

“This dataset is a powerful new way to track how knowledge moves from the leading edge of scientific research into the broader collective minds of humankind as a whole,” said Jason Priem, co-founder of ImpactStory. “We’ll use it to help us fine-tune our efforts with Unpaywall in our goal to make scholarly papers open and accessible to everyone.”

What’s next?

This work extends and complements data first released in 2015, created with a python library designed by Aaron Halfaker and extended by Bahodir Mansurov. If you are planning to use this dataset, we encourage you to cite it using its canonical reference as “Citations with identifiers in Wikipedia” (hosted on FigShare).

This data release is only a first step among many to come in understanding how citations are used on Wikipedia. In the next few months, we’ll focus on additional analyses of citations in Wikimedia projects, to understand how they are accessed by readers, since we care about the public being able to verify information that Wikipedia cites. We’ll also continue to work with partners to promote the use of this data and deepen our research of citation practices on Wikipedia.

As Wikipedia becomes every more ingrained in the fabric of the world’s knowledge–as a resource that aims to provide fact-based, neutral information that people can trust–we need to understand and cultivate our citation culture and make sure we can constantly vet it for biases, gaps and omissions.

Miriam Redi, Research Scientist                             
Dario Taraborelli, Director, Head of Research
Jake Orlowitz, The Wikipedia Library
Ben Vershbow, Lead Programs Manager (libraries, education, cultural heritage)
Wikimedia Foundation

You can also read this post on our Medium publication. The graphs in this post are by Miriam Redi/Wikimedia Foundation, and freely licensed under CC BY-SA 4.0.