Rates of confirmed plagiarism among articles written by different groups of users, including both blatant plagiarism and subtler close paraphrasing

Adding plagiarism to an article is one of the quickest ways to make a Wikipedian angry. It undermines the integrity of Wikipedia — contributors only have the right to release their own work under our free license — and it takes a lot of work to clean up. And as a community of writers, we take original authorship very seriously.

The Wikipedia Education Program helps professors run Wikipedia assignments, where students improve Wikipedia as part of their class. (Want to get involved as an instructor, or as a volunteer to help classes get started? Get in touch.) And when student editors plagiarize in their Wikipedia contributions, no one is happy.

To try to better understand the problem of plagiarism — across Wikipedia, and among student editors in particular — my team at Wikimedia Foundation recently did a little research project. We identified English Wikipedia articles by student editors in the U.S. and Canada editions of the Wikipedia Education Program, as well as articles by a set of other editors who were statistically similar to the students, new editors from different years, and veteran Wikipedians. Then we worked with a company called TaskUs to put each of the articles through a commercial plagiarism checker. The first results we got showed shockingly high rates of plagiarism for every group. But the majority of these were actually cases of other sites copying Wikipedia, so we went through manually to confirm which ones were actually plagiarism. (It’s amazing where you’ll find the work of Wikipedia editors, across the web and even in print sources.)

We found in the end that for new articles by new users during the years 2006, 2009, and 2012, the rate of confirmed plagiarism was 10–12 percent. For new articles by student editors, it was 5 percent, while for the control group of non-student editors who had similar editing patterns, the confirmed plagiarism rate in their new articles topped 13 percent. We found higher rates of plagiarism in articles expanded by student editors: around 8.5 percent. There is no control group to compare for the expanded articles, but we would expect higher rates of plagiarism in expanded articles in general, since there are fewer barriers to expanding an article on English Wikipedian than for creating a new one. We also looked at plagiarism rates — for new and expanded articles — among the early contributions by admins, as well as the most prolific editors who are not admins. For both of those groups, we found rates around 3 percent — some of which was actually added originally by others, and then built upon by the now-experienced editors.

These numbers aren’t perfect, and there’s still much we don’t know about plagiarism on Wikipedia. (On the research page, you can also check out the details of the project, see the caveats of the methodology, and download the raw data.) For this study, we’re not sure just how much plagiarism slipped through without being detected, nor whether the types of sources plagiarized by student editors were more likely to slip through. But it gives us a basic idea of the prevalence of plagiarism among new editors.

For student editors in particular, because we get the chance to provide more structured guidance and training than with the typical newcomer, we think we can do better. Based on what we found in this plagiarism research, we’ve created a new video for the student training modules that explains what plagiarism is, why it’s bad for Wikipedia, and what happens when editors get caught plagiarizing.

The new plagiarism tutorial video

Sage Ross
Online Communications, Wikipedia Education Program