In this article, you will find all the up-to-date information regarding the subject of the crawl budget. You will understand how the crawling process works and the factors that can affect it. And then you will learn what actions can be taken to improve the crawl efficiency of a website.
Unfortunately, Googlebot is not Scrooge McDuck. It has a limited budget…
How crawling and indexing works?
To really understand the crawl budget we have to learn how Google determines which and how many pages to crawl. The process is based on the following three principles: 1) crawl rate limit, 2) scheduling, and 3) crawl demand.
Crawl rate limit
Crawl rate is the number of “parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches.” And as described on the Webmaster Central Blog, “Googlebot is designed to be a good citizen of the web.” It has to consider your server capabilities, making sure that it wouldn’t overload the bandwidth while crawling your website. Therefore Google will adjust the crawl rate to the server response. The slower it gets, the lower the crawl rate.
The complexity of the crawling process requires Googlebot to create a list of addresses which it intends to visit (as Gary Illyes explained). Then the requests to the enlisted URLs are queued. This list isn’t random. The whole process is called scheduling, and to prioritize the URLs, Google uses a sophisticated mechanism called crawl demand.
This factor determines which and how many pages should be visited during a single crawl. If Googlebot considers a certain URL important enough, it will place it higher in the schedule. According to Google, the importance of a URL is evaluated by:
- Popularity – URLs that often get shared and linked on the Internet will be considered more important, and thus will have a bigger chance of being crawled by Googlebot. As you can see, popularity is closely related to the authority of a page and PageRank.
- Staleness – Generally speaking, fresh content has a higher priority over the pages that haven’t changed much over the years.
In fact, we have experienced how important new content is for Google, and how it can directly influence the crawl budget. The website of one of our clients suffered from a bug that caused a massive increase in the number of URLs. It went up from about 250K to more than 4.5 million in just one second. Shortly, a massive appearance of new pages led to a vast increase of the crawl demand. The situation could be observed in the Google Search Console:
However, it’s worth noting that the impact of the new content on the crawl budget was only temporary. Right after all of the new URLs had been visited, the number of pages crawled per day returned to its previous state, even more rapidly than how it increased.
What probably happened is that Google realized these newly discovered pages aren’t of high quality and decided not to visit them too often.
Why the crawl budget is so important?
There was a hot discussion going on under John Mueller’s tweet where he stated: “IMO crawl-budget is over-rated. Most sites never need to worry about this.” And if you read the Webmaster Central Blog post that I’ve mentioned before, you probably came across the following statement:
Of course, as an SEO specialist, I can agree that crawl rate optimization will benefit mostly huge websites (such as big e-commerce shops). From our experience at Onely, if a website contains more than 500K URLs, it’s almost certain to suffer serious crawling issues. If you own such a website, you certainly shouldn’t overlook this aspect.
This man didn’t pay attention to the budget and now the whole country is screwed. Don’t let this happen to your website.
But what about smaller domains?
Well, I’d say in many cases you can get away with not caring about the crawl budget if there aren’t any serious issues affecting it. The problem is, you won’t be aware of the situation unless you actually start investigating the matter.
Furthermore, even if a website seems rather small at first glance, it might in fact contain tens of thousands of URLs. Just imagine that using faceted navigation can easily transform 100 pages into 10000 unique URLs. Not to mention the bugs and misimplementations in CMS that in some circumstances can produce interesting results. Just recently I’ve come across a website that consisted mostly of homepage duplicates and copies of the offer pages. All because a custom-built CMS didn’t have any solution on how to handle non-existing URLs.
In this case you might predict where the majority of the crawl budget was spent.
Considering all the above, at the very least it’s worth checking the website’s crawl budget and seeing if everything is working fine.
Get insights on how the bots are crawling your website
In order to optimize the crawl budget, first you will need to identify what issues are affecting it. There are several ways you can get some insights on what Googlebot is actually crawling within your website.
Google Search Console
GSC is an essential tool in the arsenal of every SEO expert. It presents you a lot of useful information regarding the status of the website within the Google Search Engine. And last year, the new version of GSC rolled out of beta. The updated tool offers a lot of useful, new functionalities that were largely described in Tomek Rudzki’s article about the new GSC.
Below I will highlight a few of the features that can give you valuable information (and since not all of the old features of the tool made their way to the new version, I will still cover some of the old GSC reports):
- The Coverage section in Overview (in the new GSC) will show you a number of indexed pages in the form of a graph. See the huge rise on the screenshot below? Such a rapid increase in the number of indexed URLs should make you suspicious.
- The Index Coverage report in the new GSC can tell you which parts of the website were visited by Googlebot. You can see not only the indexed URLs, but also the pages excluded from the index (due to canonical, noindex meta tags or other causes), etc. It’s great for discovering issues, but unfortunately you are only restricted to a limited number of examples. However, if for some reason you don’t have access to the server log files, the Index Coverage report is the closest you can get.
- Crawl > Crawl stats in the old GSC (unfortunately, the new version of the tool still doesn’t offer this feature, but you can click go to the old version button to see it) will show you how the number of pages crawled per day changed over time. As you have already learned, an abnormal increase in the crawled URLs can be caused by a sudden rise of a crawl demand (e.g. thousands of new URLs suddenly appeared).
Server Log Analysis
Server log files contain entries regarding every visitor of your website, including Googlebot. Here you will find the exact data of what was actually crawled by Google (all JS, CSS, images and other resources included). If instead of crawling your valuable content Googlebot wanders astray, the log analysis will tell you about it.
I can see what you’re doing there, you lousy Googlebot!
What tool to use?
To get a good picture, you need to extract logs concerning at least three weeks, or even the whole month. Such logs can reach some ridiculous file sizes, therefore you will have to use a tool capable of processing such a large amount of data.
Fortunately, such dedicated software exists:
Another option is to use Splunk. It’s super expensive, but you can download the trial version for free with no restriction regarding the file sizes or the number of entries. The trial should be enough for a single SEO project. If you decide to choose this tool, you should definitely check out our article on how to perform a server log analysis in Splunk, and learn to do it like a pro.
How to identify the correct user agent?
Since the log files contain entries of every visitor, you need to be able to extract only the data concerning the Googlebot. How should you do that? If your idea was to find its User-Agent string, I’m afraid that’s the wrong answer.
That was your million-dollar question.
Because everyone can pretend to be a Googlebot (even you, by simply changing the UA string in Chrome Developer Tools), the best approach would be to filter Googlebot by IP. We wrote an entire article about identifying different crawlers. But, to make a long story short, the IPs of Googlebot generally begin with: “66.249”.
What should you check during the server log analysis?
There is a number of aspects you should always investigate while performing a server log analysis:
- Status codes. Healthy logs should consist mostly of status 200s and 301s (304s can also appear if you’re using a cache policy). If any other status codes appear in considerably large numbers, it’s time to worry. You should look for 404 pages, as well as 5xx errors. The latter might indicate serious performance-related issues of your server.
- Most frequently crawled parts of our website. You should check which the directories and pages get the largest number of visits. In an ideal world, the bot should be mostly crawling the parts where your most valuable content is placed. For example, on an e-commerce store you want it to visit the product and category pages.
- URL parameters. By investigating the server logs, you can easily identify all the URL parameters that are being used on your website. Then you will be able to configure the bot behavior in GSC. Parameters that don’t change the content of a page (such as sorting by price, popularity etc.) can be blocked from crawling in your Google Search Console.
- Check if your website is affected by mobile-first indexing. In order to do that you will have to compare the number of visits coming from the mobile and desktop Googlebot (the list of UA strings used by Google is here – it will help you distinguish the bots). If the number of mobile events covers more than 50% of all Googlebot visits, then you have a strong clue that mobile-first indexing had hit your website. In such a case, you should definitely read our Quick and Easy Guide to Mobile-first Indexing.
Learn how to optimize the Crawl Budget
You have already learned a lot about the crawl budget, how it works and the ways you can track the bot’s behavior on your pages. However, before I describe common bugs and how to solve them, I have to introduce you to two tools essential for crawl budget optimization.
Robots.txt, The Lord of The Bots
A tiny text file called robots.txt is one of the most powerful tools you can use while tweaking crawling issues. It contains directives that Googlebot has to obey.
This is what I call an exaggeration.
I assume that most people reading this kind of article is already familiar with the functionality of the file, as well as the basic directives. Nevertheless, If you wish to get more info, I strongly recommend checking the official Google documentation about robots.txt. You can also read this ultimate guide on the subject.
The easiest way to optimize the bot’s budget is by simply excluding certain sections of your website from being crawled by Google. For example, during the log analysis of one of our clients, we found out that instead of crawling the service offer, the bot eagerly spends its time visiting irrelevant calendar pages. Disallow: /profile-calendar in robots.txt solved the issue.
Things to keep in mind:
The Disallow: directive in robots.txt won’t stop the page from being indexed. It will only block access to a certain page from internal links. However, if the bot enters the URL from an external source (before it can check the robots directives), the page still might get indexed. If you wish for a certain page not to appear in the Google index, you should use meta robots tags.
You should never disallow paths of resources (such as CSS and JS) that are essential for a proper page rendering. The bot has to be able to discover the full content of pages.
After you create a robots file, remember to submit it to Google via Google Search Console.
While disallowing and allowing certain directories and paths, it’s easy to mess things up and block a necessary URL by accident. Therefore you should use a dedicated tool to check your set of directives.
Creating a sitemap.xml can help in efficient crawling
According to Gary Illyes, sitemap XML is the 2nd best way for Google to discover pages (number one being, obviously, links). It’s not a huge discovery, as we all know that a properly created sitemap.xml file will serve as feedback for Googlebot. It can find all the important pages of your website there, as well as notice recent changes. Hence it’s essential to keep your sitemaps fresh, and free of bugs.
A sitemap.xml file is like a map for Googlebot. Because it, um, maps things…on your website.
A single sitemap file shouldn’t contain more than 50000 URLs. If the number of unique, indexable pages on your website is bigger, you should create a sitemap index that would contain the links to all sitemap files. As you can see in the following example:
What should be included in the sitemap (all of the following conditions must be met)?
- URLs returning a HTTP status code 200;
- URLs having meta robots tags: index, follow ;
- Canonical pages (in other words, NOT canonicalized to a different page)
You should also send your sitemap to Google by using GSC. It’s also considered good practice to place a link to your sitemap in the robots.txt file, by using a following command:
More information about sitemap files can be found in the official Google documentation.
Common issues affecting the crawl budget and how to solve them
In the last part of our great crawling adventure I will focus on the number of factors that might have a negative impact on crawling. In order to optimize the crawl budget, you will have to be able to recognize and avoid common pitfalls.
A great crawling adventure.
This is directly caused by the way Google’s crawling and indexing works. While the process is fairly simple for HTML pages, rendering a JS-heavy website is like trying to settle official matters in a government-run office in Poland. It requires a lot of additional work that delays an otherwise simple process.
- The first wave happens right after the crawler visits the page. Then the indexer (called Caffeine) renders the initial HTML.
John Mueller’s tweet about the two waves of indexing.
Solving the JS rendering issue
We have confirmed information that Google plans to make crawling and rendering integrated. Until that happens, we have to find a way to overcome the existing problem. And the best solution is to take the rendering work away Google, by implementing dynamic rendering or hybrid rendering.
As the name implies, dynamic rendering means that the site will detect in real-time what kind of visitor sent a request to the server. As a result, normal users will still get client-side rendered, fully-interactive content, while search engine crawlers will be served prerendered, static HTML.
There are external service providers (such as Prerender.io and Prerender Cloud) that will crawl, render, and cache your content, serving it to the search engines. The big plus of this solution is that it’s easy to set up, and doesn’t require you to create and maintain your own infrastructure.
Alternatively, the whole process can be done on your own server. In such a case, you can use two open source solutions based on the headless Chrome browser that will enable appropriate functionality:
You can also install a prerender.io middleware that will perform the same task.
Of course, you will have to ensure that appropriate content will be served to the correct visitor. For this, your server will have to check the user-agent string (adding reverse DNS lookup to ensure the request actually comes from Google). Also, it is important to differentiate between mobile and desktop user-agents, to always serve content adjusted to the type of device.
The whole feature might seem tricky to implement, therefore I advise you to take a look at the Google Guidelines for Dynamic Rendering. Furthermore, a recent article from Maria Cieślak (Onely’s Head of Technical SEO) highlights the pros and cons of different solutions. Additional information about different rendering options can also be found here.
Low performance has an impact on the crawl rate
See the difference?
Earlier in the article, I explained that the crawl rate will be adjusted according to your server capabilities. Poor website performance can result in the server being easily overloaded, and a decreasing number of visits received from Googlebot as a consequence. In fact, during our cooperation with a number of clients, we observed a direct correlation between the number of pages crawled per day and the time spent downloading a page (which can be observed in Google Search Console).
The performance can be affected by the number of factors, but I can give you a few tips that will help you in your quest for decreasing page loading times:
- Decrease TTFB – Time To First Byte is the delay between sending the request to the server and downloading the first byte of data. The easiest way to improve TTFB is by implementing cache on the server side.
- Audit the Website’s Performance – you can use a number of tools to investigate what issues are currently affecting your page loads, and find opportunities for an improvement. Your best pick would be Google Lighthouse that is already integrated with Google Chrome. Other tools created by Google, such as Mobile Speed Test and PageSpeed Insights will also provide useful information. You can utilize GTMetrix and WebPageTest to get additional insights.
- Learn how to improve performance from The Ultimate Guide to Website Speed. And if your website is set up on WordPress, you should definitely check out our blog post that will help you decrease page loading time by a few simple steps.
- Make sure your mobile website is optimized to the edge. And the first step to success is picking the right technology. Check out the Mobile Technology Showdown and choose the best one for your website.
Internal redirects can kill the budget
Every time a bot encounters a redirected URL, it has to send one additional request just to get to the final URL. At first glance this might not seem like a big deal, but think about it this way: if you have 500 redirects, it’s actually 1000 pages to crawl.
And 5313623 redirects is actually 10627246 pages to crawl
And that’s only in the case of single redirects. But sometimes we can find long redirect chains, such as the one you can see below:
As you can see, there are six (!) redirects involved and the end result is a 404 error page. The funny thing is that Google probably won’t get to this 404 page, as it follows up to five redirects on a single URL.
You cannot avoid having redirected URLs pointing to your website from the outside sources (in fact, you should use a 301 to make sure your content will still be accessible if the link isn’t up to date), but you have to make sure that after entering your website, the bot won’t encounter any redirected internal URLs.
How to take care of internal redirects
- Perform a full crawl of your website with one of the many are many tools such as Ryte, DeepCrawl, SiteBulb or Screaming Frog. Never used any of these? Then you should definitely visit our Beginners Guide To Crawling (and if you’re struggling with which crawler to choose, please read The Ultimate Guide to SEO Crawlers).
- After the crawl, identify the redirected URLs that the tool encountered, as well as the source page where the given link is placed. In Screaming Frog, you can use Bulk Export > Response Codes > Redirections (3XX) Inlinks for redirects, and Redirect & Canonical Chains report to find redirect chains (and check out this link if you wish to learn how to export such data in Sitebulb):
- Update the links found on the source pages, so they all point to the destination URLs (HTTP status code 200) directly.
Bad Information Architecture vs Googlebot
A well-thought, logical structure of the website is vastly important for SEO. And I can list a number of fairly common IA issues that can have a huge impact on the crawl budget.
Having content copied over a number of pages not only results in a duplicate content issue, it can also impact crawling, as duplicate pages take space in the crawling schedule. There are several categories of duplicate content that are worth addressing:
- Unnecessary non-canonicals using canonical links isn’t a bad thing by itself. In fact, it is recommended by Google as a way of dealing with duplicate content. However, you have to keep in mind that every canonicalized page is problematic (why crawl a duplicate when the bot might spend this time visiting a more valuable page?). Additionally, while visiting a duplicate, the bot would have to check the canonical page, to ensure it actually is duplicate content resulting in yet another unnecessary set of requests sent to the server.
Which John Malkovich is canonical?
Therefore, instead of going crazy with canonicals, you should always ask yourself the following questions to see if the duplicate page is really necessary:
- Does it improve navigation?
- Does it serve any actual purpose?
- Would replacing it with the canonical page create any issues?
If all the answers are no, then maybe you should consider removing the duplicate page and replacing all the internal links with those pointing to the canonical one. In such a case, you should also remember to redirect (by using HTTP status code 301) the URL of the removed page to the original.
You should only leave duplicate, non-canonical pages in your website’s IA if it’s absolutely necessary. Similar principles apply to noindexed pages, as a large number of them can also affect the crawl budget.
Then there are random duplicates. Sometimes you might not even be aware that your website contains a number of duplicate pages. This might be the result of a bug, misimplementation, or it might simply be caused by the way a CMS handles URLs.
Such a problem can easily be identified by typing site:yourdomainname.com in Google search and digging in the index. Or just go to your GSC> Coverage> Excluded and look for duplicate content. If you are surprised by the number of pages you find, you first need to check their origin. Then, several actions should be taken:
- Deindex duplicates by placing noindex, follow meta tags in the code. Don’t block them in robots.txt yet, as this would prevent the bot from revisiting and deindexing the pages.
- Only after all the problematic pages have been deindexed you should block the appropriate path in robots.txt.
- After that, you should remove problematic pages (fix the issue that was producing duplicates) from the website, and redirect deleted URLs to the canonical version.
If you have duplicates that have never been indexed or didn’t get any links from outside sources, you can simply remove them and use status code 410, instead of redirects.
Remember that calendar issue I mentioned earlier? That calendar had a unique URL assigned to every month. And on every month page, there were just two links: the first pointed to the previous month page and another linked to the next. You could browse back to the birth of Christ (if you had patience), or book a service for new year’s eve of 2100. As a result, Googlebot could be trapped in an infinite crawling process, simply by following the link to the next month. This is a perfect example of an infinite space.
Book our service for August 2682! Better do it now, while you still can.
If your website currently contains infinite spaces, you should first consider if you really need such pages. If not, after removing them, make them return HTTP status code 410. If those infinite spaces are necessary, you have to make sure that the bot won’t be able to crawl or index the pages:
- Place meta robots noindex meta tag in the HTML code;
- If none of the pages were indexed, you can block the infinite spaces in robots.txt. If some pages were already indexed, first you have to wait before Google removes the pages from the index. Only then should you block the path in robots.txt.
Internal linking helps in efficient crawling
With a nice network of internal links, Googlebot will move through your website like Spider-Man!
Internal links create paths used by Googlebot to navigate your website, and a well-developed linking structure will ensure an efficient crawling of your valuable content. On the other hand, a lack of internal links can result in Google being reluctant to crawl certain sections of the website.
While designing an internal link structure, you should avoid common pitfalls:
- Linking to 404 error pages – you don’t want to send Googlebot to non-existing pages.
- Orphan pages – pages that are present in the sitemap, but haven’t been linked internally. Googlebot might decide not to visit them that often.
- Pages having long click path – make sure your most important content is available no more than three clicks from your strongest page (which in most cases will be the homepage). We already know that for Google the place of a given page within a website’s architecture is far less important than the click path.
- Spammy links – often placed in the footer section or at the bottom of the page, dozens of links having keywords stuffed in the anchor text. Such links will be largely ignored by Googlebot and won’t add any value to your pages.
Visualizing the structure of internal links can help you identify the areas for improvement. You can learn how to do it using Gephi. However, some popular SEO crawlers such as Screaming Frog, SiteBulb and Website Auditor also enable such a feature.
While improving internal link structure, you should follow the best practices:
- Your most important pages should get the largest number of internal links;
- Link to related topics (in the case of articles) or products/categories (in an ecommerce store). Let your content be discovered;
- Contextual links inside the articles add value for both users and search engines (Googlebot will use anchor texts to better understand the website’s structure);
- Don’t over-optimize – the anchor text should be natural and informative, don’t stuff unnecessary keywords within;
- Don’t just stuff the page with links – make sure they add the actual value for users.
You should also check our post about developing navigation on your website.
Bugs in the sitemap/lack of XML sitemap
If your website currently doesn’t have a XML sitemap, you should definitely build one according to the guidelines I described earlier in the article, and send it to Google via GSC. Since it will help Googlebot in discovering new content and scheduling the crawl, you have to be sure that all your unique, indexable pages are listed in the file. Also, you should always keep your sitemap fresh and free of issues.
What SHOULDN’T be included in the sitemap file?
- URLs returning a HTTP status code other than 200;
- URLs of pages containing meta robots tags: noindex, follow or noindex, nofollow;
- URLs blocked by the robots.txt file;
- paginated pages.
The best way to investigate what issues are currently affecting your sitemaps is to use an SEO crawler. Most available tools (including Screaming Frog, SiteBulb, Ryte, Deepcrawl) will give you the option to analyze sitemaps while performing a full crawl.
After a long journey we have finally reached the end. I hope that at this point you have a good understanding of the crawling process. All the information you got from this article can be useful in real life, while actually working on a living website. If you keep to the best practices, you will ensure efficient crawling, no matter how many URLs your website has. And remember, the larger your website is, the more important the crawl budget becomes.