robots.txt

Robot Icon, courtesy of Wikimedia Commons

This is not a very exciting title for a post, granted, but this little file contains quite a bit of power, especially on the Wikimedia websites. The little lines of command found in this file tell us what pages should not be included when search engines like Google or Yahoo! spider Wikimedia content.

Many of the commands in robots.txt are there for technical reasons. For example, we do not want search engines to index dynamically-generated pages, such as the Search page, because this would put too much of a load on our servers.

However, we have also included some discussion pages in robots.txt. The issue here is not so much article content but rather all the bickering, flamewars, and name-calling that we often find on discussion pages.

Consider this one aspect: Search engines are used constantly by employers hunting for information about prospective employees. Imagine a candidate being rejected because of an unanswered late entry to a year-and-a-half old conversation telling Joe Q. Lastnamehere that he is a liar and con man and his authority is fraudulent. You may believe that such an employer would be legally wrong to base a hiring decision on such a frail source, but people make these sorts of decisions all the time by using search engines.

Robots.txt already keeps search engines from spidering several types of discussion, including page deletion discussions on several wikis. By excluding those pages from search engines, we can keep the discussion on-wiki without broadcasting “non-notable” or “spammer” on every search. This has dramatically reduced the number of complaints our OTRS volunteers have received about these discussions.

As some of our users have discovered, though, there is another hazard of search engines: user discussion pages. These pages often contain users’ real names, and often call those people “vandals” or “plagiarists” or “biased”. These can be as bad as deletion discussions, if not worse.

All projects should be aware of the potential hazards of not including these pages in spidering. It may be time to coordinate your language namespaces so that you may be able to prevent any hazardous issues resulting from non-mainspace discussions about people. You can request that the developers add items to the robots.txt file by filing a bug at http://bugzilla.wikimedia.org.

Very truly yours,

Cary Bass, Volunteer Coordinator

Categories: Technology, Wikipedia
Categories:
4 Show

4 Comments on robots.txt

Dcoetzee 6 years

There have been recent proposals to noindex all talk pages on the English Wikipedia – I’m opposed to this on the general principle that it’s best to let the search engines index as much as possible and sort out relevance on their own (determining relevance of webpages is what search engines do, after all). As long as talk pages aren’t misleadingly portrayed as sources of factual information, rather than opinions of individuals, I don’t think there’s a problem – and a forum interface for talk pages will go a long way towards cementing that impression.

David Gerard 7 years

@thewub: If you follow wikitech-l lately, you’ll see there’s a lot of active ongoing work on the internal search – it’s noticeably better of late.

the wub 7 years

Perhaps the Foundation could concentrate some developer time and funding on improving our own search system. This would reduce reliance on external sites, and allow broader exclusion of “working” pages from Google etc.

NonvocalScream 7 years

Very informative. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *