This is not a very exciting title for a post, granted, but this little file contains quite a bit of power, especially on the Wikimedia websites. The little lines of command found in this file tell us what pages should not be included when search engines like Google or Yahoo! spider Wikimedia content.
Many of the commands in robots.txt are there for technical reasons. For example, we do not want search engines to index dynamically-generated pages, such as the Search page, because this would put too much of a load on our servers.
However, we have also included some discussion pages in robots.txt. The issue here is not so much article content but rather all the bickering, flamewars, and name-calling that we often find on discussion pages.
Consider this one aspect: Search engines are used constantly by employers hunting for information about prospective employees. Imagine a candidate being rejected because of an unanswered late entry to a year-and-a-half old conversation telling Joe Q. Lastnamehere that he is a liar and con man and his authority is fraudulent. You may believe that such an employer would be legally wrong to base a hiring decision on such a frail source, but people make these sorts of decisions all the time by using search engines.
Robots.txt already keeps search engines from spidering several types of discussion, including page deletion discussions on several wikis. By excluding those pages from search engines, we can keep the discussion on-wiki without broadcasting “non-notable” or “spammer” on every search. This has dramatically reduced the number of complaints our OTRS volunteers have received about these discussions.
As some of our users have discovered, though, there is another hazard of search engines: user discussion pages. These pages often contain users’ real names, and often call those people “vandals” or “plagiarists” or “biased”. These can be as bad as deletion discussions, if not worse.
All projects should be aware of the potential hazards of not including these pages in spidering. It may be time to coordinate your language namespaces so that you may be able to prevent any hazardous issues resulting from non-mainspace discussions about people. You can request that the developers add items to the robots.txt file by filing a bug at http://bugzilla.wikimedia.org.
Very truly yours,
Cary Bass, Volunteer Coordinator