The sad reality is that most webmasters have no idea what a robots.txt file is. A robot in this sense is a “spider”. It’s what search engines use to crawl and index sites on the internet.
Basically a spider will crawl a site and index all the pages (that are allowed to) of that site. Once that’s complete the robot will then move on to external links and continue it’s indexing. This is how search engines find other sites and build such a large index of sites. They depend on other sites to link to relevant websites, which will link to others and so on.
When a search engine (or robot or spider) hits a site the first thing it will look for is a robots.txt file. Remember to keep this file in the root directory.
This will ensure that the robot will be able to find the file and use it correctly. This file will tell a robot what to crawl. This system is called “The Robots Exclusion Standard“.
Pages that are disallowed in your robots.txt file won’t only be not indexed but they won’t be crawled either.
The format for a robots.txt file is a special format but it’s very simple. It consists of a “User-agent:” line and a “Disallow:” line.
The “User-agent:” line refers to the robot. It can also be used to refer to all robots.
Example of how to disallow all robots:
To disallow all robots from indexing a certain folder on a site, we’ll use this:
For the User-agent line we used a wild card “*” to refer to the robot which tells all robots to listen to this command. So once a spider reads this, it will then know that the /cgi-bin/ should not be indexed at all. This will include all folders contained in it.
Specifying certain bots is also allowed and in most cases very useful to users that utilize doorway pages or other ways of search engine optimization. Specifying certain bots will allow a site owner to tell a spider where to index and what not to index.
Here is an example of restricting access to the /cgi-bin/ from Google:
This time with the User-agent command we used Googlebot instead of the wildcard command “*”. This lets Google’s spider know we’re talking to it specifically and not to crawl this folder or file.
White Space & Comments
White spaces and comment lines can be used but are not supported by most robots. When using a comment it is always best to add it to a new line.
User-agent: googlebot #Google Robot
Notice on the first one the comment line is on the same line indicated by a # then the comment. While this is ok and will be accepted in most cases, a lot of robots may not utilize this. So be sure to use example 2 when using comments.
In most cases if Example 1 is used and a robot does not support it, the robot will interpret the line as “googlebot#GoogleRobot”. Instead of “googlebot” like we originally intended.
White spaces refer to using a blank space in front of a line in order to comment it out. It is allowed but not always recommended.
Common Robot Names
Here are a few of the top robot names:
- Googlebot – Google.com
- YandexBot – Yandex.ru
- Bingbot – Bing.com
These are just a few common robots that will hit a site at any given time.
The following examples are commonly used commands for robots.txt files.
The following allows all robots to index an entire site. Notice the “Disallow:” command is blank; this tells robots that nothing is off limits.
The following tells all robots not to crawl/index anything on a site. We used “/” in the “Disallow:” function to specify the entire contents of a root folder to not be indexed.
The following tells all robots (specified by the wildcard command in the “User-agent:” function) to not allow the cgi-bin, images, and downloads folder to be indexed. It also doesn’t allow the admin.php file to be indexed, which is located in the root directory. Subdirectory files and folders can also be used in this case.
This tells the Google Bot not to index the wp-admin folder.
More information on robots.txt files can be found on Robotstxt.org. Remember that all the major sites will use a robots.txt file. Just punch in a URL and add robots.txt file to the end to find out if a site uses it or not. It will display their robots.txt file in plain text so anyone can read it.
Remember that the robots.txt file isn’t mandatory. It’s mainly used to tell spiders what to crawl and what not to crawl. If everything is to be indexed on a site, a robots.txt file isn’t needed.