Best robots txt for drupal hosting
Every day, millions of people use Google Image Search to find pictures, products, and people. If you're using Drupal, chances are you're not getting any of this traffic.
Drupal's robots.txt file contains a major mistake. Amazingly, the mistake has been there for years, and very few people seem to know about it.
Take a look at this excerpt from the default Drupal robots.txt file. Can you spot the problem?
By default, every image you upload to your Drupal site gets stored somewhere inside the "sites" directory. And, by default, Drupal is blocking every search engine from looking inside your "sites" directory. In other words, your images aren't getting indexed!
If you've got a Drupal site with images you want other people to find, this is a serious problem. (I discovered this by accident last week, when I noticed none of the images on my Photoshop Text Effects site were getting indexed by Google).
To illustrate just how common this problem is, let's take a quick look at Dries Buytaert's blog. Dries is, of course, the creator of Drupal, but he's also a very good photographer. In fact, Dries has uploaded thousands of photos to his blog, including hundreds of pictures from DrupalCon and dozens of insightful graphs and charts. But how many of these images has Google actually indexed?
Only 13. Unfortunately, Dries's robots.txt file contains the standard "Disallow: /sites/" line.
If Dries is affected, you probably are, too. Running an e-commerce site? Your entire product line could be missing from Google Image Search. Have a photography blog? Yahoo and Bing are probably ignoring everything you post.
If no one can search for your images, you're literally turning away traffic. And not just image search traffic: High-quality, indexable images are a key feature of any high-ranking site. If your images aren't indexable, you're making a major SEO mistake.
Even worse, this problem doesn't just affect images. PDFs, Flash files, text documents, and other uploads all go into the same "sites" folder. Google knows how to index these files, but your robots.txt file is stopping GoogleBot cold.
Fortunately, the solution is easy: Just remove "Disallow: /sites/" from your robots.txt file. The file is located in your main Drupal directory and can be edited with a standard text editor. Google should pick up the changes within a few days and start indexing your files shortly after.
Fixing the robots.txt file should be a priority for the next Drupal point release. This is a major problem with a simple solution. Fortunately, someone has already created an issue on Drupal.org. Unfortunately, it's been unresolved for over a year. Let's change that.
Update: A fix for Drupal 6 was released on December 12th. If you're running Drupal 6.20 or later (including Drupal 7), this issue no longer affects you.
Did you find this article helpful? Check out my Drupal hosting review.
Posted by John on 2010-08-30