Something Evil About Robots.txt I Didn’t Know
August 28, 2013 2 Comments
Quick background: A robots.txt file on your website will tell search engines and other bots that obey the robot exclusion standard what files and folders they can and can’t index, or whether they can access the website at all.
I’ve been working on the robots.txt file at work the last few days.* Once the file had the bots I wanted to exclude I decided to run it through a robots.txt validator.
Boy did I learn a few things. It turns out that you should put robot exclusions at the top and directory and file exclusions below. There were also a few minor formatting issues that I’m not sure really mattered.
There was one, however, that was a shock. Let’s say you’ve got a folder called “video”. There’s a huge difference between these two disallow statements:
Disallow: /video/ Disallow: /video
The first example with a trailing slash tells robots not to index anything in the video directory. So far so good. The second example without a trailing slash tells robots not to index anything in the video directory, or any file at the root level with video at the beginning of the filename.
Without the trailing slash, you would exclude /video.html, videoplayer.aspx – you name it. Anything at the same level of the directory structure that begins with video. You can get into trouble in a hurry if you leave the backslash off of the disallow directive.
* What prompted the work was all of the bots that kept showing up in our error files. One of the worst? The Internet Archive Bot that collects pages for the Internet Archive. It would generate hundreds of errors a day. When I looked around at bot ban lists the IA bot showed up over and over. You’d think Internet Archive wouldÂ have worked the bugs out of their bot by now.