The sad reality is that most webmasters have no idea what a robots.txt file is. A robot in this sense is a “spider.” It’s what search engines use to crawl and index websites on the internet.
A spider will crawl a site and index all the pages (that are allowed) on that site. Once that’s complete, the robot will then move on to external links and continue its indexing. This is how search engines find other websites and build such an extensive index of sites. They depend on websites linking to relevant websites, which link to others and so on.
When a search engine (or robot, or spider) hits a site, the first thing it will look for is a robots.txt file. Remember to keep this file in your root directory.
Keeping it in the root directory will ensure that the robot will be able to find the file and use it correctly. The file will tell a robot what to crawl and what not to crawl. This system is called “The Robots Exclusion Standard.”
Pages that you disallow in your robots.txt file won’t be indexed, and spiders won’t crawl them either.
The format for a robots.txt file is a special format but it’s very simple. It consists of a “User-agent:” line and a “Disallow:” line.
The “User-agent:” line refers to the robot. It can also be used to refer to all robots.
An Example of How to Disallow All Robots:
To disallow all robots from indexing a particular folder on a site, we’ll use this:
User-agent: * Disallow: /cgi-bin/
For the User-agent line, we used a wildcard “*” which tells all robots to listen to this command. So once a spider reads this, it will then know that the /cgi-bin/ should not be indexed at all. This will include all folders contained in it.
Specifying certain bots is also allowed and in most cases very useful to users that utilize doorway pages or other ways of search engine optimization. Listing individual bots will allow a site owner to tell specific spiders what to index and what not to index.
Here is an example of restricting access to the /cgi-bin/ from Google:
User-agent: Googlebot Disallow: /cgi-bin/
This time with the User-agent command we used Googlebot instead of the wildcard command “*.” This line lets Google’s spider know we’re talking to it specifically and not to crawl this folder or file.
White Space & Comments
White spaces and comment lines can be used but are not supported by most robots. When using a comment, it is always best to add it to a new line.
User-agent: googlebot #Google Robot
User-agent: googlebot #Google Robot
Notice on the first one the comment line is on the same line indicated by a # then the comment. While this is ok and will be accepted in most cases, a lot of robots may not utilize this. So be sure to use Example 2 when using comments.
In most cases, if Example 1 is used and a robot does not support it, the robot will interpret the line as “googlebot#GoogleRobot.” Instead of “googlebot” like we originally intended.
White spaces refer to using a blank space in front of a line to comment it out. It is allowed but not always recommended.
Common Robot Names
Here are a few of the top robot names:
- Googlebot – Google.com
- YandexBot – Yandex.ru
- Bingbot – Bing.com
These are just a few common robots that will hit a site at any given time.
The following examples are commonly used commands for robots.txt files.
The following allows all robots to index an entire site. Notice the “Disallow:” command is blank; this tells robots that nothing is off limits.
User-agent: * Disallow:
The following tells all robots not to crawl or index anything on a site. We used “/” in the “Disallow:” function to specify the entire contents of a root folder not to be indexed.
User-agent: * Disallow: /
The following tells all robots (indicated by the wildcard command in the “User-agent:” function) to not allow the cgi-bin, images, and downloads folder to be indexed. It also doesn’t allow the admin.php file to be indexed, which is located in the root directory. Subdirectory files and folders can also be used in this case.
User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /downloads/ Disallow: admin.php
This list tells the Google Bot not to index the wp-admin folder.
User-agent: googlebot Disallow: /wp-admin/
Whether a specific spider will respect the robots.txt file is out of our control. In other words, there is no guarantee that a spider indeed won’t visit a page you disallowed. Some bots indeed don’t respect it. They are mostly marketing crawlers and spammers. In fact, bad actors (hackers) will probably take a look specifically at what you disallowed.
You can find more information on robots.txt files on Robotstxt.org. Almost all the major sites use a robots.txt file. Just punch in a URL and add robots.txt to the end to find out if a site uses it or not. It will display their robots.txt file in plain text so anyone can read it.
Remember that the robots.txt file isn’t mandatory. It’s mainly used to tell spiders what to crawl and what not to crawl. If everything is to be indexed on a site, a robots.txt file isn’t needed.