There is no SEO without crawling. If a crawler doesn’t visit your content, it doesn’t exist in a search engine, which is why crawling is paramount.
You also need to be mindful of your crawl budget – the number of pages Googlebot will regularly crawl based on the size and cleanliness of your site, as well as the number of links directing the crawler to your site.
A Different Crawling Experience
To get started, let me explain a couple of basic concepts.
You can’t really get to #3 without #1, etc. If you want to understand the difference between crawling and indexing, please read this great article by Barry Adams.
[It’s indexing, by the way, and not indexation. Indexation means something completely different. If you’re unsure of the difference, use indexation when writing about this subject on Twitter and then tag Barry Adams. You’re welcome.]
Crawling and executing HTML is as simple as it gets. All the content is visible in HTML, and the only “extra step” needed to index the content is executing CSS to see the styles.
- Download HTML & CSS in parallel.
- Wait for CSS to finish downloading and execute it.
- Render, and continue rendering as HTML downloads.
- Download HTML (it’s tiny).
- Download CSS and JS in parallel.
- Wait for CSS to finish downloading and execute it.
- Wait for JS to finish downloading and execute it.
- (In many cases, Single-Page Apps (SPAs) wait until this point to start downloading data).
- Update the DOM & render.
A New Experiment
In previous experiments, I noticed a lot of problems with Googlebot not visiting every URL. I often had to manually fetch URLs to “invite” Googlebot to visit them. However, I never had such a problem with HTML.
This led me to create a new experiment (multiple experiments, actually, but we’ll get there).
The structure of all the websites in the experiment was the key to measuring crawling efficiency.
Each page had only one link, which looked something like this:
Homepage -> page1 -> page2 -> page3 -> page4 -> page5
Thanks to this structure, I was sure that if, for instance, page5 was crawled, this meant that Googlebot also had to go through the homepage, page1, page2, page3, and page4. There was no other way for Googlebot to find those pages (no sitemaps, etc.).
So how did Googlebot do?
The HTML-based pages were crawled and indexed within a couple of hours. Google Search Console started showing six indexed pages within a week or so.
As you can see, it’s been a few months, and even though I shared this website’s URL at many conferences and let the cat out of the bag, Google had still only indexed the website’s homepage plus one content page.
We repeated this experiment over different domains in different configurations (including SPA, which I will write about soon) only to see identical results each time.
Feedback from Google
After seeing the results of my experiment, I reached out to Googler Ilya Grigorik in search of some answers.
As you can see, Ilya’s response was quite broad. What is worth pointing out is how he mentions that “there are factors at play” without going into detail about what they actually may be.
It’s almost as if he is insinuating that he knows what they are, but he’s certainly not going to be sharing them with me.
I quickly got an interesting reply from John Mueller:
What you’re seeing is probably (I ran pretty much the same test last year :)) mostly:
– Rendering takes a bit of time, so there’s a delayed effect when it comes to discovering links that are added via rendering. This is primarily a temporary effect, though, once discovered, it’s the same as any other link.
– Test sites are inherently artificial in nature, and our algorithms tend to get bored with them quickly. That makes this kind of test a bit hit & miss or look worse than it would be in practice.
Anyway, it’s cool to see these kinds of tests. Thanks for sharing them & the results!
John’s reply is interesting; however, I don’t think it solves the problem I am seeing across all the test domains. John’s point that the crawl budget is based on host load is logical, but if you ever worked with a large domain, you already know how difficult it is to get Googlebot to crawl your new content. I wish it were based on host load only. I also think it’s telling that John mentioned that he ran a similar test, but he’s not sharing the results.
This is one of those articles that I’ve been putting off writing forever as I’ve been constantly searching for more data because, like I said at the beginning, each experiment result provides an answer with even more questions. Are “there other factors at play here”? Are we dealing with algorithms that “tend to get bored”? Or is there something else we’re missing?
This is a topic I will be continuing to explore and will, of course, share any new discoveries my research brings to light.
January 8, 2018 Update:
After publication, John Mueller reached out to us to clarify some of the points addressed in this article. So take it away, John:
“Hi! I just thought I’d send you a short note regarding [your article] – nice summary & graphics, btw! I think conflating crawl budget with what you’re seeing is wrong & kinda distracts from your message. Crawl budget (from our POV, & from what people generally view it as) is the # of URLS fetchable from a server/time, which includes JS, CSS, etc. It’s a technical limitation of how much can potentially be fetched from a server, and that doesn’t change depending on the type of content that it is. What you’re seeing is a change in rate of *indexing* from following links – there I agree that things are slower with sites that require JS in order to see the links. The difference is subtle, but if you know what kind of problem a site has (crawl rate vs. indexing rate), the actions you could take to fix the problem are very different.”