We Need to Redefine the Crawl Budget

We-Need-to-Redefine-the-Crawl-Budget - 000-Redefining-Crawl-Budget
quick summary

Previously we redefined JavaScript websites, and in the same spirit of helping the SEO community, it’s time we reevaluated how SEOs view the crawl budget.

Google has made huge leaps in rendering and indexing JavaScript content over the last year. 

First of all, Googlebot is now regularly updated to use the latest Chromium engine for rendering. Considering that before this update, Googlebot was using a 4-year old version of the rendering engine, it’s truly a massive step forward.

What this means for webmasters is that they no longer need to be afraid of using the latest JavaScript syntax – the majority of the modern features that can be rendered by the user-facing Chrome browser can also be technically rendered by Googlebot. 

Before, developers had to provide fallback content to serve to the Google crawler, as it couldn’t otherwise render pages that were using web components and other new JavaScript features. 

Secondly, it was announced during the Chrome Dev Summit in November 2019 that the delay between crawling and rendering a page by Googlebot is now 5 seconds on the median. Last year Googlers declared that it could take up to a week.

But some pages that are in Google’s index still don’t have their JavaScript content indexed after weeks from the first crawl – only because of the limited crawl budget.

So it seems that we need to reinterpret the definition of the crawl budget.

The Past

For years, optimizing the crawl budget meant, for the most part, dealing with index bloat. 

Particularly when it comes to large websites, Googlebot is often forced to crawl through tons of worthless content to find the most important pages. 

When an e-commerce website’s navigation is based on adding parameters to URLs, a couple of dozen product category pages can turn into thousands of duplicate URLs for Googlebot to crawl.

To mitigate this, SEOs work with business owners to make sure that all the important content is indexed despite the duplicate pages issue. 

Google Search Console notified you about indexing problems due to duplicate content? Read our guides and fix:

This work involves also the crawl rate limit and the crawl demand.

Crawl Rate Limit

This factor primarily depends on a given website’s server health. Since “Googlebot is designed to be a good citizen of the web,” it adjusts its crawling rate based on how the server reacts to continuous requests.

To avoid crashing the server and ruining the experience of users visiting the website, it will limit the crawling rate when the server responds poorly. Similarly, the crawling rate will go up if the server has no problem handling intense robot activity.

Alternatively, webmasters can manually decrease the crawling rate using Search Console.

The takeaway: Make your server and your page as fast and reliable as possible, and Googlebot will be able to crawl as much as it deems necessary.

Crawl Demand

Crawl demand is the priority that Google assigns to crawling a given website. The two main factors influencing the crawl demand are popularity and staleness

Here’s how Gary Ilyes from the Google Search team defined those:

A quote from Google's official documentation on the crawl budget

Straightforward, right? Well, the SEO community picked these two points apart and came up with several different interpretations of what Google could have actually meant. 

To give you an example, Gary Ilyes stated that popular URLs tend to get crawled more often. Some people assumed that popularity reflects the traffic volume. Others concluded (with some prehistoric evidence) that a page’s popularity is reflected by internal and external linking as well as the number of keywords a given URL is ranking for.

It’s also unclear what Google exactly meant by saying, “our systems attempt to prevent URLs from becoming stale in the index.” Do they factor in the time since the URL was last crawled, or do they have a way of predicting changes in the content? Maybe both of these are used among many other variables…

The Present

Every content creator is led to believe that creating great content offering users what they truly need is the main prerequisite to do well in search. And it’s true, for the most part.

But in some cases, it may not be enough. Even if you did your best to provide content that’s optimized in every way and you expect it to gain traction, search engines may see it differently. Not because they aren’t fond of your writing, but because they may not even see your content. 

The only way to get some organic traffic is to get your pages crawled and indexed, and as we’ve observed, you can never be completely sure as to when that will happen.

While some don’t really care if they appear in the search results only after a couple of hours, for others, being 30 minutes late is giving all the potential traffic to a faster competitor.

We have data showing that many websites, even some of the most popular ones, wait for many hours on average to get their content indexed by Google. Worse than that, it seems that some pages never get indexed at all. This problem is particularly common among large websites that depend on JavaScript to a degree.

 But JavaScript might not be the root of the problem – it could be the crawl budget.

The Future

In light of this, it seems that the definition of the crawl budget should be expanded.

With the rise of JavaScript as the irreplaceable building block of the web, optimizing the crawl budget has to become much more technical. 

It often happens that seemingly well-optimized pages still struggle with a low crawl budget, but webmasters still can’t figure out what’s wrong. In that case, JavaScript SEO could be the missing piece of the puzzle.

Why is this an issue?

Take the time to browse through a couple of your favorite websites with JavaScript rendering turned off. Do all of them offer the same value in the plain HTML version?

A page using JavaScript means additional requests that Googlebot needs to send to the server to index it. Every JavaScript file that needs to be downloaded and rendered counts as a separate URL towards the crawl budget. And when you merge JavaScript files to mitigate that problem, they may become too heavy for Googlebot, so it takes lots of experience to find the right balance.

As a consequence, it is never guaranteed that Googlebot will render JavaScript upon crawling every page.

A visual representation of the vicious circle of the low crawl budget

And if it isn’t, the website may become trapped in the vicious cycle of the low crawl budget: 

  1. A web page depends on JavaScript to be rendered. 
  2. JavaScript doesn’t get rendered because of the low crawl budget.
  3. Googlebot only discovers parts of the page.
  4. Google assumes that the page is of low quality.
  5. The crawl budget gets further lowered.
  6. And back again…

Wrapping up

Google’s official definition of the crawl budget – “the number of URLs Googlebot can and wants to crawl” – is still perfectly valid. However, for SEOs, the crawl budget has always been a working definition rather than a concrete metric – the aggregate of issues that influence how often and how thoroughly a website is crawled by the search engines. 

Our data shows that JavaScript usage should be considered one of the primary factors influencing the crawl budget.

And if you care about your crawl budget optimization, feel free to reach out for a JavaScript SEO audit. Still unsure of dropping us a line? Read how technical SEO services can help you improve your website.