Robots.txt is a file you can create to control the crawling of your website.
It’s the practical implementation of the Robots Exclusion Protocol, which was created to prevent web crawlers from overwhelming websites with too many requests.
Even though it’s not necessary for your website to use robots.txt, having one can positively affect your business by optimizing how search engine bots crawl your site.
According to the 2021 Web Almanac, ~16,5% of websites don’t have a robots.txt file at all. Additionally, not everyone implements it correctly.
Depending on the size of your website, improperly using robots.txt can be a minor mistake or a very costly one.
This article will show you how to create a robots.txt file and avoid potential mistakes.
What is robots.txt?
Robots.txt is a simple text file that you can place on your server to control how bots access your pages. It contains rules for crawlers, defining which pages should or shouldn’t be crawled.
The file should be located at the root directory of your website. So, for example, if your website is called domain.com, the robots.txt file should live at domain.com/robots.txt.
But how does the file work? How do bots discover it?
Crawlers are programs that crawl the web. They have various uses, but search engines use them to find web content to index. This process can be divided into a few steps:
- Crawlers have a queue of URLs containing both new and previously known websites they want to crawl.
- Before crawling a website, crawlers first look for a robots.txt file in the website’s root directory.
- If no robots.txt file exists, crawlers proceed to crawl the website freely. However, if a valid robots.txt file exists, crawlers look inside it for the directives and proceed to crawl the website accordingly.
If a search engine can’t crawl the page, then that page can’t be indexed, and consequently, it won’t appear on search result pages.
However, there are two caveats:
1. A page that’s blocked from crawling might still get indexed
Disallowing crawling in a robots.txt file does not guarantee that search engines won’t index the page. They might still do it if they find information about the content in other sources and decide it’s an important one. For example, they can find links leading to the page from other sites, use the anchor text and show it on the search results page.
Learn how to overcome this issue by reading our article on how to fix the “Indexed, though blocked by robots.txt” status.
2. You can’t force robots to obey the rules in robots.txt
Robots.txt is only a guideline, not an obligatory rule. You can’t force the bots to obey it. Most crawlers, especially those used by search engines, won’t crawl any pages blocked by robots.txt. However, search engines are not the only ones that use crawlers. Malicious bots may ignore the instructions and access the pages anyway. That’s why you shouldn’t use robots.txt as a way of protecting sensitive data on your website from being crawled. If you need to make sure bots won’t crawl some of your content, it’s better to protect it with a password.
Why do you need a robots.txt file?
Robots.txt is not an obligatory part of your website, but a well-optimized one can benefit your site in many ways.
Most importantly, it can help you with crawl budget optimization. Search engine bots have limited resources, restricting the number of URLs they can crawl on a given website. So if you waste your crawl budget on less important pages, there might not be enough for more valuable ones. If you have a small website, this might seem like a superficial issue, but anyone who maintains a large website knows how vital it is to use the resources of search engine bots efficiently.
With the robots.txt file, you can prevent certain pages, e.g., low-quality ones, from being crawled. It’s crucial because if you have many indexable, low-quality pages, it might affect the whole site and discourage search engines bots from crawling even the high-quality pages.
Additionally, robots.txt allows you to specify the location of your XML sitemap. A sitemap is a text file listing the URLs you want search engines to index. Defining its link in the robots.txt file makes it easier for search engine bots to find it.
How to modify the robots.txt file
How you can modify your robots.txt file is highly dependent on the system you use.
If you’re using a CMS or an eCommerce platform, you might have access to dedicated tools or plugins that can help you access and modify the file easily. For example, Wix and Shopify allow you to edit robots.txt directly. For WordPress, you can use plugins like Yoast SEO.
If you don’t use a CMS or an eCommerce platform, you might need to download the file first, edit it and then upload it back on your site.
You can download the file in various ways:
- Display the file in your browser by adding the “/robots.txt” to your root directory and then simply copy the content.
- Use the tools provided by your hosting service. For example, it might be a dedicated panel for managing files or access through the FTP protocol.
- Use console tools like cURL to download the file by typing this command:
curl https://example.com/robots.txt -o robots.txt
Once you download robots.txt, you can simply edit it in your text editor of choice, like Notepad (Windows) or TextEdit (Mac). Ensure to encode the file in the UTF-8 standard, and remember that it must be named “robots.txt”.
After modifying robots.txt, you can upload the file similarly to downloading it. You can use dedicated tools provided by your hosting, use CMS build-in tools, or send files directly to the server by the FTP protocols.
Once your file is publicly available, search engines can find it automatically. If for some reason, you want search engines to see the changes right away, you can use Submit option in Google’s and Bing’s robots.txt testers.
Robots.txt consists of blocks of text. Each block starts with a User-agent string and groups directives (rules) for a specific bot.
Here’s an example of the robots.txt file:
User-agent: * Disallow: /admin/ Disallow: /users/ #specific instructions for Googlebot User-agent: Googlebot Allow: /wp-admin/ Disallow: /users/ #specific instructions for Bingbot User-agent: Bingbot Disallow: /admin/ Disallow: /users/ Disallow:/not-for-Bingbot/ Crawl-delay: 10 Sitemap: https://www.example.com/sitemap.xml
There are hundreds of crawlers that may want to access your website. That’s why you might wish to define different boundaries for them based on their intentions. Here’s when User-agent may come in handy.
User-agent is a string of text that identifies a specific bot. So, for example, Google uses Googlebot, Bing uses Bingbot, DuckDuckGo uses DuckDuckBot and Yahoo uses Slurp. Search engines can also have more than one User-agent. Here you can find a complete list of User-agents used by Google and Bing.
User-agent is a required line in every group of directives. You can think about it as calling bots by their names and giving each of them a specific instruction. All the directives that follow a User-agent will be aimed at the defined bot until the new User-agent is specified.
You can also use a wildcard and give instructions to all the bots at once. I will cover the wildcards later.
Directives are the rules you define for search engine bots. Each block of text can have one or more directives. Each directive needs to start in a separate line.
The directives include:
Note: There is also an unofficial noindex directive that’s supposed to indicate that a page should not be indexed. However, most search engines, including Google and Bing, do not support it. If you don’t want some pages to be indexed, use the noindex Meta Robots Tag or X-Robots-Tag header (I will explain them later in the article).
User-agent: Googlebot Disallow: /users/
This directive specifies which pages shouldn’t be crawled. By default, search engine bots can crawl every page not blocked by the disallow directive.
To block access to a particular page, you need to define its path in relation to the root directory.
Let’s imagine you have these two sites on your website:
Now let’s look at some examples of blocking these paths:
|Disallow: /item1.html||Only the /products/shoes/item1.html is disallowed|
|Disallow: /products/||Both /products/shoes/item1.html and /products/shirts/item2.html are disallowed|
You can disallow the crawling of the whole site by adding the “/” symbol in the following way:
User-agent: Googlebot Disallow: /
User-agent: Googlebot Disallow: /users/ Allow: /users/very-important-user.html
You can use the allow directive to allow crawling of a page in an otherwise disallowed directory.
In the example above, all pages inside the /user/ directory are disallowed except for one called /very-important-user.html.
The sitemap directive specifies the location of your sitemap. You can add it at the beginning or end of your file and define more than one sitemap.
Unlike the paths defined in other directives, always add a full URL of your sitemap, including the HTTP/HTTPS protocol or www/non-www version.
The sitemap directive is not required, but it’s highly recommended. Even if you submitted your sitemap in Google Search Console or Bing Webmaster Tools, it’s always a good idea to add it to your robots.txt file to help all of the search engine bots find it quicker.
Search engine bots can crawl many of your pages in a short amount of time. Each crawl uses a part of your server’s resource.
If you have a big website with many pages, or opening each page requires a lot of server resources, your server might not be able to handle all requests. As a result, it will become overloaded, and both users and search engines might temporarily lose access to your site. That’s where the Crawl-delay directive may come in handy and slow down the crawling process.
The value of the Crawl-delay directive is defined in seconds. You can set it between 1-30 seconds.
It’s important to note that not every search engine follows this directive. For example, Google doesn’t support Crawl-delay at all.
Additionally, the interpretation of it may vary depending on a search engine. For example, for Bing and Yahoo, Crawl-delay represents the length of a window gap during which the bot can access the page only once.
For Yandex, Crawl-delay specifies the amount of time the bot needs to wait before requesting another page.
Comments in robots.txt
#Blocks access to the blog section User-agent: Googlebot Disallow: /blog/ User-agent: Bingbot Disallow: /users/ #blocks access to users section
You can add comments in your robots.txt file by adding the hash # charactter at the beginning of a line or after a directive. Search engines ignore everything that follows the # in the same line.
Comments are meant for humans to explain what a specific section means. It’s always a good idea to add them because they will allow you to understand quicker what’s going on the next time you open the file.
You can use comments to add easter eggs to the robots.txt file. If you want to learn more about it, you can check out our article on making your robots directives fun for humans or see an example in our robots.txt.
Wildcards are special characters that can work as placeholders for other symbols in the text, and therefore simplify the process of creating the robots.txt file. They include:
- Asterisk *, and
- Dollar sign $.
The asterisk can replace any string.
In the above example, the asterisk in the User-agent line specifies all the search engines bots. Therefore, every directive that follows it, will be aimed at all crawlers.
You can also use it to define a path. The above examples mean that every URL that ends with a “?” is disallowed.
The dollar sign indicates a specific element that matches the end of an URL.
The example above indicates that every URL that ends with “.jpeg” should be disallowed.
You can use wildcards in every directive, except sitemap.
Testing the robots.txt file
You can also edit the file directly in the robots.txt testers and retest the changes. Keep in mind that the changes are not saved on your website. You need to copy the file and upload it to your site on your own.
If you’re more tech-savvy, you can also use Google’s open-source robots.txt library to test the robots.txt file locally on your computer.
Robots.txt vs. Meta Robots Tag vs. X-Robots-Tag
Robots.txt is not the only way to communicate with crawlers. You can also use the Meta Robots Tag and X-Robots-Tag.
The most important difference is the fact that robots.txt controls the crawling of a website, while Meta Robots Tag and X-Robots-Tag allow you to control its indexing.
Among other things, these methods also differ in ways of implementation.
|Robots.txt||Simple text file added at the root directory of your website.|
|Meta robots tag||HTML tag added in the <head> section of the code.|
|X-Robots-Tag||Part of an HTTP response header added on the server-side.|
When a search engine bot finds a page, it will first look inside the robots.txt file. If crawling is not disallowed, it can access the website, and only then it can find potential Meta Robots Tags or X-Robots-Tag headers. It’s important to remember for two reasons:
- Combining the methods – search engine bots need to be allowed to crawl the page to see the Meta Robots Tag and X-Robots-Tag. If bots cannot access the page, they won’t work correctly.
- Optimizing the crawl budget – among these three methods, only robots.txt can help you save the crawl budget.
Here are some best practices and tips while creating a robots.txt file:
- Ensure to add the link to your sitemap to help all search engine bots find it easily.
- Interpretation of robots.txt syntax may differ depending on a search engine. Always double-check how a search engine bot treats a specific directive if you’re not sure.
- Be careful when using wildcards. If you misuse them, you might block access to the whole section of your site by mistake.
- Don’t use robots.txt to block your private content. If you want to secure your page, it’s better to protect it with a password. Additionally, the robots.txt file is publicly accessible, and you could potentially disclose the location of your private content to dangerous bots.
- Disallowing crawlers from accessing your site won’t remove it from the search results page. If there are many links with descriptive anchor text pointing to your page, it can still be indexed. If you want to prevent it, you should consider using Meta Robots Tag or X-Robots-Tag header instead.