This is not an article about one of the hundreds of technical issues that can wait in your backlog for weeks before they get addressed.
This SEO issue is a critical business problem for your entire organization.
Indexing is a necessary step before your website can be shown on Google.
And on average, 16% of valuable pages on popular websites aren’t indexed.
Here are some examples:
- Walmart.com: 45% of product pages are not indexed.
- dictionary.cambridge.org: 99.5% of pages are not indexed.
These websites are large. But it doesn’t mean that smaller websites aren’t at risk:
- Even a small website can have indexing issues because of technical problems (e.g., sylvesterstallone.com).
- There are websites with unique content that have indexing issues (e.g., victoriassecret.com).
This is the Ultimate Guide to Indexing SEO. It compiles everything that I learned from 5+ years of studying this topic and running experiments.
I wrote this guide to help you understand why some of your website’s pages may not be indexed by Google and provide you with the solutions required to fix this serious problem.
The Google index
Organic traffic is the backbone of online business, but you won’t get any if Google doesn’t index your content.
To understand what indexing is and why Google doesn’t properly index some websites, we need to understand exactly what Google’s index is and how it works.
Google has a great analogy:
Essentially, the Google index is a database of web pages that Google knows about. Once these pages are indexed, Google can use the information it has about them and their content to decide to show them in search results.
The concept is fairly simple. But the road to getting indexed is complicated.
Google’s indexing pipeline
First, Google has to discover an URL. In the process of going through the web, Google extracts links from newly discovered web pages. These pages can be discovered in multiple ways: following links on other internet pages or in sitemaps or looking at where inbound links are coming from.
Then, Google has to visit the page. Google has sophisticated algorithms allowing them to define which URLs should be prioritized. Then Googlebot visits pages that meet the priority threshold.
Finally, Google extracts the content of a page. Google evaluates quality and checks if the content is unique. Also, this is the step when Google renders pages to see all of their content, evaluate their layout, and various other elements. If everything is fine, the page gets indexed.
This is a fairly simplified breakdown – each of these steps actually consists of additional stages – but these are the crucial steps.
After your page gets through these phases and is successfully indexed, only then can it be ranked for relevant queries and shown to users, bringing organic traffic to your website.
One exception is when you intentionally prevent Google from visiting your page using your robots.txt file, making it impossible for Google to crawl it. Google can then still index the page using a link found on a different page. That being said, you won’t likely get lots of traffic to that page from Google because it won’t know what the page contains and won’t know if it’s relevant to users.
Here’s an example of that happening in the wild with one of Google’s own products.
In this case, Google blocked its own robot, Googlebot, from crawling all pages on the Google Jamboard subdomain.
But Googlebot was still able to find links to Jamboard pages on other websites and used these links for indexing.
This case highlights something vital.
Notice that the indexed home page of Google Jamboard has no description displayed inside the snippet. That’s because Googlebot wasn’t able to access it and relay that information to the index.
As a website owner, you need to make sure that Googlebot can access as much content on your site as possible. Otherwise, Google will have limited information on what your page is about, and your search visibility will suffer.
Does Google index all pages?
The answer is clear: No.
In the past couple of years, I ran the numbers multiple times, using a database with thousands of different websites.
On average, 16% of valuable, indexable pages on popular websites aren’t indexed. Ever.
And it’s no secret. Google openly admits their goal is not to index every single page on the web. Google’s John Mueller had this to say on the topic:
You might say, “Okay, Google just doesn’t index everything, so I guess if some of my valuable pages aren’t indexed, it’s not a big deal.”
But I think this is the wrong approach. There are actually many large sites that Google can fully index.
You can do various things to help Google index more pages on your website, and you should.
Every other SEO effort you make on your website will have a diminished ROI if you still have unindexed content.
How long does it take for Google to index a page?
As I already showed you, many pages simply don’t get indexed by Google, and even more don’t ever get crawled.
To makes things worse, it’s common that indexing happens with a significant delay.
We track the indexing of many popular websites. This allows us to observe how long it takes for Google to index new pages on average (and remember, we’re skipping the pages that never get indexed here).
These statistics show how common indexing delays are:
As you can see:
- Google indexes just 56% of indexable URLs after 1 day from being published.
- After 2 weeks, just 87% of URLs are indexed.
Google has a sophisticated system of managing how it crawls websites.
Some websites are crawled more frequently, and some websites are visited less frequently. In the short term, you cannot influence it, but there are many things you can do to improve your standing in the long run. We’ll talk about them later.
There’s one more indexing issue that I’ve studied extensively, and this one is the most difficult to define and address. I call it partial indexing.
While I consider it an indexing issue, an argument can be made that it’s also a ranking issue.
Here’s what it’s all about:
Sometimes a page gets indexed by Google, but parts of the content of that page don’t. My research shows that these unindexed content fragments don’t contribute to the page’s rankings.
They can’t be found when you specifically search for them, and they seem to not contribute to the page’s overall rankings.
Sometimes, these content fragments are less important, for example, related items/products.
But quite often, it’s the main content of the page, like the main product description on a product page of an eCommerce site.
|Website||% of indexed pages with main content not indexed||Additional notes|
|aboutyou.de||37%||On mobile, product details are hidden under tabs.|
|walmart.com||45%||On mobile, product details are hidden under tabs.|
In my opinion, the most common cause for partial indexing is duplicate content.
The websites shown above commonly use the producer’s product description, and it seems Google is filtering it out in the indexing/ranking phase.
Why is indexing a challenge?
So, why doesn’t Google just index every page it knows about?
The web is growing
The basic reason is that the web is simply too big. And it’s still growing.
According to WorldWideWebSize, there are over 5 billion pages on the internet as of March 2021.
And most of those pages aren’t exactly valuable to Google’s users. The web is full of spam, duplicate content, and harmful pages that contain malware and phishing content.
Google has learned to avoid crawling those pages, let alone indexing them.
Websites are getting heavier
An average website is getting heavier each year.
While this offers new possibilities to users, Google needs to render all that heavy code and access these heavy media to understand what a given page is about.
As all of these challenges only get more serious, we should expect Google to be even pickier when indexing content in the future.
Because the web is too big for Google to index fully, Google has to choose which pages it wants to index.
And, obviously, Google wants to focus on quality pages. So Google’s engineers developed mechanisms of avoiding crawling low-quality pages.
This means that Google may skip crawling some of your pages because, having seen your other content, it assumes they are low-quality pages.
In this scenario, your pages drop out of the indexing pipeline right at the beginning.
We’re trying to recognize duplicate content in different stages of our pipeline. On the one hand, we try to do that when we look at content. That’s kind of like after indexing – we see that these two pages are the same, so we can fold them together.
But we also do that, essentially, before crawling, where we look at the URLs that we see and based on the information that we have from the past, we think, “Well, probably these URLs could end up being the same, and then we fold them together.source: John Mueller
The data available thanks to Google Search Console confirms this is happening very often. “Discovered – currently not indexed” is one of the most common indexing issues, and it’s usually caused by:
- Low quality (Google detected a common pattern and decided not to waste resources crawling low-quality or duplicate content).
- Insufficient crawl budget (Google has too many URLs to crawl and process them all).
I spoke more about my research on Google Search Console’s most common indexing issues in my article over at SearchEngineJournal.
Assigning priority to URLs
This quote suggests that Google is assigning a crawling priority to every URL before it’s crawled. But more importantly, it states that less important URLs are rejected and may never get crawled!
According to that very patent, the priority assigned to URLs can be determined by two factors:
- A URL’s popularity,
- Importance of crawling a given URL for maintaining the freshness of Google’s index.
Google’s “Minimizing visibility of stale content in web searching including revising web crawl intervals of documents” patent talks about the factors that define a given URL’s popularity: view rate and PageRank.
But there’s one more factor that may cause Google to give up crawling your URLs – your server. If it responds slowly to crawling, the priority threshold that a URL needs to meet is increased:
This probability estimate is based on the estimated fraction of requested URL crawls that can be satisfied. The fraction of requested URL crawls that can be satisfied has as the numerator the average request interval or the difference in arrival time between URL crawl requests.”
This probability estimate is based on the estimated fraction of requested URL crawls that can be satisfied.
The fraction of requested URL crawls that can be satisfied has as the numerator the average request interval or the difference in arrival time between URL crawl requests.”
So what can you do with all that information? How to improve the chances that all your URLs will be assigned high priority and get crawled by Googlebot without hesitation?
- You need to make the most out of internal linking to make sure new pages have enough PageRank.
- Just having an XML sitemap isn’t nearly enough if you’re hoping to get your new pages indexed quickly.
- Having tons of low-quality content may negatively impact other pages on your domain.
When indexing issues are not your fault: Google’s indexing bugs
Google Search is a truly complex mechanism, made of hundreds (and maybe even more) interconnected algorithms and systems. Some of the smartest programmers and mathematicians work there.
However, like every piece of software, it has some bugs.
To my knowledge, the most famous indexing bug happened on October 1st, 2020.
We are currently working to resolve two separate indexing issues that have impacted some URLs. One is with mobile-indexing. The other is with canonicalization, how we detect and handle duplicate content. In either case, pages might not be indexed….
— Google SearchLiaison (@searchliaison) October 1, 2020
It was really rough because Google had removed the Request Indexing feature from the Google Search Console just a day before.
After 2 weeks, it was announced that the canonical issue was effectively resolved, with about 99% of the URLs restored.
Let me point to another interesting example of Google’s indexing bug.
One of the most popular publishing websites in the SEO branch, Search Engine Land, once got totally deindexed by Google.
Search Engine Land got deindexed because… Google systems wrongly detected that the website had been hacked.
Normally, Google informs website owners about detecting such issues through Google Search Console. However, the team at SEL didn’t receive any notifications in GSC nor by email.
What I’m trying to say by talking about these cases is that indexing is a very complex system and that bugs will happen now and then.
If something goes wrong with most of the things that it’s supposed to do, that will show downstream in some way. If scheduling goes awry, crawling may slow down. If rendering goes wrong, we may misunderstand the pages. If index building goes bad, ranking & serving may be affected
— Gary 鯨理／경리 Illyes (@methode) August 11, 2020
Diagnosing your website’s index coverage
As the first step of your indexing journey, you should check your website’s indexing statistics.
You HAVE to know how many pages are not indexed and why.
Use Google Search Console
The best way is to use the Google Search Console because it has the most accurate data.
- Log in to GSC and select a property
- Click on Index->Coverage.
The report is divided into intuitive categories:
- Valid (indexed pages)
- Valid (indexed pages that need your attention)
- Excluded pages (URLs that are not indexed).
- Error pages
You will quickly notice how many pages on your site are indexed. You can further narrow down the report to see a sample of indexed pages.
You can easily use this report to diagnose indexing issues. I wrote an article about that.
You will easily discover how many pages are:
- Not indexed because of duplicate content, quality issues, server errors, etc.
GSC is a treasure for everyone with a website.
Don’t use the “site:” command
I don’t recommend using the site: command to check your index coverage.
Some people use this command to find out how many pages Google indexed from their website.
However, this is not an accurate method. More importantly, it won’t tell you why some pages may not be indexed. Google Search Console will.
That doesn’t mean this command is not useful.
You can use it to get a rough estimate of how many pages your competitors have in Google’s index. Just remember, it’s not very accurate!
How to make sure your pages get indexed by Google
You now know that Google’s index is a complex system of interconnected algorithms.
Things can go wrong at each step of the indexing pipeline, and it may not even be your fault.
But there are things you can do to maximize your chances of getting indexed by Google.
1. Make sure the page is indexable
There are three things you need to look at to check if a page is indexable.
- The page can’t have the noindex tag
- The page can’t be blocked by robots.txt
- The page can’t have a canonical tag pointing to another page.
Let’s dig in.
Googlebot is a good citizen of the web.
If you tell Google: “Hey, don’t index this page,” the page won’t be indexed. And there are many ways to do that.
The most commonly known is the “noindex” directive.
It’s a directive showing that Google can visit a page, but a page shouldn’t be included in the Google index.
There are two ways of using the noindex directive:
- You can place it in the X-Robots tag HTTP header
- You can place it in the source code with the classic <meta name=”robots” content=”noindex”/>
The robots.txt file can be used to give instructions to various web crawlers, telling them whether or not they should access your website or its parts.
You can use robots.txt to tell Google not to crawl a page or multiple pages on your site using the disallow directive.
This blocks Google from visiting a page and indexing its content.
Finally, you shouldn’t expect Google to index your page if it has a canonical tag in its source code pointing to a different page.
Canonical tags are a way to let Google know about your preferred version of a page when there are many duplicate or near-duplicate versions of the same page on your website.
They come in handy when, for whatever reason, you have duplicate content on your site but want to consolidate ranking signals and let Google index and rank the one master version of the page.
It follows that if a page on your website has a canonical tag pointing to a different page, Google won’t index it.
How to check noindex, robots.txt directive, and canonical tag all at once
Manually inspecting a page for the three factors mentioned above is time-consuming. Moreover, it’s error-prone!
So when you quickly want to check if a page is indexable, use the SEO Minion plugin. It’s available for Chrome and Firefox.
SEO Minion will inform you about the reasons why a given page is not indexable.
If you want to check a larger amount of URLs, the best way is to use an SEO crawler like Screaming Frog.
First – set the Mode to “List.”
Second, paste the list of URLs to the tool.
Then click “Start.”
Once the check is done, check the indexability column. You will see two self-explanatory results: Indexable / Non-Indexable.
Now you should know if your pages are indexable. Congratz!
But this is only the beginning.
2. Help Google crawl your website more efficiently
Google should be able to find links to your important pages just by crawling your website.
However, it gets more complicated when you have a huge website with thousands of pages. There are a couple of ways in which you can help Google discover your URLs and crawl them faster.
The XML Sitemap is a file that should contain links to all the indexable pages of your website.
Here’s what Google has to say about sitemaps:
So you can use sitemaps to inform Google about the pages that you definitely want to be indexed.
Furthermore, you can use it to let Google know when your pages were changed using the <lastmod> parameter, and if there are alternate versions (e.g., when you have multiple language versions, you can use the hreflang tag in the sitemap to point Google to variants of the same page).
|Sitemap attribute||Is it supported in Google?|
Note that if you overuse the <lastmod> parameter, Google may end up ignoring it.
Only put valuable URLs in the sitemap!
As I mentioned earlier, sitemaps help Google to crawl your website more intelligently.
But if you misuse them, they may actually hurt your site.
Let me show it to you with an example: GoodReads, a very popular brand.
I checked their index coverage, looking at a sample of their URLs from a sitemap.
It turned out that just 35% of their product pages are indexed. I was shocked, as I know that it’s a very high-quality website. I use it myself, and I love it.
Then I noticed that the sample I checked didn’t include any books. So I decided – let’s download all their sitemaps.
The result: there were no book pages in their sitemaps.
Why is it a bad sign?
Google may prioritize URLs found in sitemaps and skip visiting book pages that are actually the most valuable.
You should ensure that sitemaps only list canonical, valuable pages.
Create and submit your sitemap
After you create a sitemap, you should submit it to Search Console’s Sitemaps Tool.
Google might find it on its own, but that can take time.
When it comes to creating a sitemap, it’s effortless.
You don’t have to create the sitemap file on your own. There are many dedicated tools for that.
For instance, YoastSEO generates it automatically for you if you’re using WordPress. Most SEO Crawlers also offer that feature.
Of course, you can also create a sitemap file on your own, but remember to update it regularly, or you’ll run into trouble.
URL Submission tool
If you want Google to index your page quickly, you can use the URL Inspection tool in Google Search Console.
To do so, while inspecting a page in the URL Inspection Tool, click on “Test Live URL.”
In the past, this tool was reliable and quick – it worked like a charm.
Once you requested indexing, Google would index the page within 5 minutes. It would even index some low-quality content that you’d otherwise have a hard time getting indexed.
But things changed. Now, indexing takes time, even when you use the URL Submission feature.
So, if you want Google to index your website really fast, you shouldn’t rely on it.
And this feature is just not good enough if you have hundreds of pages that you want to be indexed because there’s a daily limit of URLs you can submit per GSC property.
Rather, you should follow our Indexing Framework.
As a side note, if you want a new page or just a piece of information to get indexed really fast, publish it on social media. Tweets usually get indexed blazingly fast.
Just like Bing, Google has an Indexing API. You can use it to ping Google about URLs added, removed, or changed and “force” Google to discover your content more quickly.
Google documentation suggests that it’s quicker than if you used other ways of submitting URLs.
Sounds too good to be true, right?
Yeah, there’s a catch.
For now, you can submit only two types of pages.
BroadcastEvent embedded in a
VideoObject. For websites with many short-lived pages like job postings or livestream videos, the Indexing API keeps content fresh in search results because it allows updates to be pushed individually.
The future of indexing?
Google’s Indexing API is limited to 2 types of pages.
However, Google has been flirting with the idea of letting the Indexing API work for all pages. Wix and YoastSEO were the companies that helped Google run these tests.
The future of the tool is unknown. However, I know that Bing’s Indexing API lets website owners submit URLs without any restrictions, and it seems that it works for them.
Here’s what Christi Olson, who is currently Head of Search Advertising at Microsoft at Bing, had to say about Indexing APIs. She (and her team) believe that URL submission helps improve crawling efficiency.
An essential aspect of SEO that has a direct effect on indexing is internal linking.
It should be clearly stated that having a URL in the sitemap is not enough to ensure that Google can crawl and index it.
I go by two rules when it comes to internal linking:
- Avoid infinite scroll.
- Don’t have canonical tags pointing to the first page of pagination.
Of course, there are exceptions to these rules. But if you aren’t sure if what you’re doing will work, stick to my rules!
Get a grip on your internal linking
Based on my experience, the following situation is widespread: a page is in the sitemap but cannot be found in your website’s structure. We call pages like that orphan pages.
One of the tools you can use to find orphan pages on your site is Sitebulb. Sitebulb does a great job and uses XML sitemap as a reference and data from Google Analytics and Google Search Console.
It will provide you with a list of orphaned pages (ones that it found in the sitemap or elsewhere but couldn’t reach by clicking around your site).
Ideas to boost your internal linking
You might be looking for ways to improve your internal linking and help Google crawl and index your site more thoroughly.
Here are some ideas to look into:
- Related products tab
- Most popular items
- Blog posts.
Writing quality content aligns perfectly with your goal of improving internal linking while also giving you a chance to earn some external links. It’s a win-win!
Then it got better, but Google used an extremely outdated browser for rendering.
Unfortunately, Googlebot doesn’t scroll or click the buttons. The only way to let Google see the second page of pagination is to use proper <a href> links.
Unfortunately, Googlebot doesn’t scroll or click the buttons. The only way to let Google see the second page of pagination is to use proper <a href> links.
Bad internal linking can hurt your site
Back in 2019, we took a look at Verizon’s website.
55% of their product pages were not indexed in Google.
I mentioned related items as one of the strategies you can use to boost your internal linking. But there’s a catch.
We commonly see that when your related items aren’t really related, Google might not index them.
I spoke about this very issue last year with Martin Splitt, a web developer advocate at Google. We talked openly about the sample I used for my tests and the methodology of our experiments.
Martin was surprised by the stats and offered his own theory (he didn’t have any data to share at that time) that the rendering phase in most of the cases is perfectly fine, but then something in the background prevents it from indexing.
He used an example of a shop selling accessories for cats, and some of the “related items” aren’t for cats but dogs.
With this hypothesis in mind, if Google notices the related items are unrelated, they may be skipped from indexing, meaning that Google won’t see those links.
If that’s the case, it has strong implications. If an online store has a poor suggestion system for related items, it loses on two levels:
- First of all, you lose the opportunity to advertise relevant products to your customers.
- Secondly, Google may not index your internal links, which weakens your PageRank flow and your website’s structure.
Some people get fixated on acquiring external links in unnatural ways which are a crucial part of black hat SEO.
Even if you think it works short-term, I promise: eventually, you’ll realize you were wasting time.
As Google gets “smarter,” these links are becoming increasingly redundant.
Our site is an example of how you can gain external links pointing to your website in a fully natural way.
From day one, our focus was on writing high-quality content that would help others.
That’s it. We write and publish, and once it’s up, we promote it on our social media.
Many other websites in our industry use the same strategy, and some probably have even better results.
If you do want to spend time building links besides just writing good content, focus on the following:
- PR: reach out to people who might be interested in your content and ask them to include it on their sites.
- Guest blogging: share your expertise on other websites. You’ll gain links and traffic, but more importantly, you’ll build your brand in the long term.
Not all content should be indexed
It may sound surprising inside a guide on getting indexed, but you shouldn’t aim to have Google index all of your content.
You should know that having low-quality content indexed may actually damage your website.
A while ago, I wrote an article analyzing why popular websites such as Instagram, Giphy, or Pinterest suddenly lost 40-50% of their SEO visibility.
I accidentally discovered that these sites suffered massive visibility losses around the same time while going through one of the SEO tools.
This looked interesting, so I tried to find common patterns. And I found one.
Many tag/search pages from these websites used to be ranking high. And then they got deindexed, just like that.
Why? I would call it “collective responsibility.” I think Google decided there are many low-quality pages of this category that occupy the index and… deindexed ALL of them.
But when this problem happens, it doesn’t just end there.
It’s a vicious circle:
- Google crawls low-quality pages.
- Google stops visiting the website as often.
- Many pages aren’t ever crawled by Google, even if they are high-quality pages.
- There are valuable pages that aren’t indexed.
This shows how ranking, crawling, and indexing are interconnected.
Can crawlers find your content?
Is your content hidden behind login forms?
If you require users to log in, fill out forms, or answer surveys before accessing content, search engines won’t see it. A crawler is definitely not going to log in.
Are you relying on search forms?
Robots cannot use search forms. Some individuals believe that if they place a search box on their site, search engines will find everything that their visitors search for. I’m sorry, that won’t happen.
Is text hidden within non-text content?
Non-text media formats (images, video, GIFs, etc.) should not be used to display text that you wish to be indexed. While search engines are getting better at recognizing images, there’s no guarantee they will read and understand the text on images. It’s always best to have any text you want to be indexed within your web page’s HTML markup.
Can search engines follow your site navigation?
Just as the crawler needs to discover your site via links from other sites, it needs a path of links on your own site to guide it from page to page.
If you’ve got a page you want search engines to find, but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get indexed.
Do you have clean information architecture?
Information architecture is the practice of organizing and labeling content on a website to improve efficiency and findability for users. Good information architecture is intuitive, meaning that users shouldn’t have to think very hard to flow through your website to find something.
Common navigation mistakes that can keep crawlers from finding your content:
- Having a mobile navigation that shows different results than your desktop navigation.
- Personalization, or showing unique navigation to a specific visitor type, could be considered cloaking by Google.
Other things you should know about indexing
These were the basics that pretty much every website owner should know.
But since this is an Ultimate Guide, this chapter will cover some of the most advanced aspects of indexing.
Below you can find a couple of examples of international websites that have issues with indexing.
|Website||Number of language versions||% of pages indexed|
What happens when you have an online store in multiple languages?
For instance, you offer your products to people from:
- United States: example.com/us
- United Kingdom: example.com/uk
- Australia: example.com/au
What Google sees is duplicate content available under different URLs. Normally, it would decide on the canonical version and only index that.
You can use it to inform Google about multiple language versions of your site.
If this sounds confusing, you can read more about it in my Ultimate Guide to International SEO.
As of March 2021, all websites fall under Mobile-First Indexing.
If MFI is a new concept to you, let me briefly explain it:
Google now crawls the mobile version of your page and uses the information it finds there for ranking.
So your mobile version is the one being crawled, indexed, and ranked.
Don’t let Google index sensitive data
So far, I mostly discussed the cases where Google doesn’t want to index content. But it can also happen that Google will index more than you wish for.
Be careful when you are publishing things like this:
- Phone number
- Any other confidential information
Remember that PDFs, Trello boards, open FTP servers can get indexed by Google too.
In Trello, a trendy project management solution, there are two types of options: you can set a project as private or public.
And because many Trello boards are set to public, many Trello boards have been indexed by Google.
After all, Trello makes it easy for Google to find them by putting them in sitemaps.
Be careful whenever you publish sensitive data on the web because removing content from Google’s index also takes time.
This brings me to my next point.
How to delete content from Google?
You can request content to be removed from Google for legal reasons.
All you need to do is to fill out a form as described in this video.
This feature might come in handy when someone copies your content and publishes it on their own website.
Web Performance is a ranking factor for Google. But this is outside the scope of this article.
What I want to talk about here is that there’s evidence that Google crawls slow pages less frequently. And less crawling means less indexing. Simple.
So if you notice that Google crawls your site less frequently or extensively than it used to, your server might be to blame. Reducing your server’s response time should allow Google to crawl faster.
And now is the time for your questions 😉
I hope I covered most of them, but if there’s still something on your mind, do let me know!
What is indexing in SEO?
Indexing is the process of storing web pages in the index – a search engine’s database. Indexing is the final step of a pipeline that every web page needs to go through in order to be retrieved and displayed to search engine users when their queries are relevant to the given page’s content.
In order to be indexed on Google, every page (with rare exceptions) must first be found by Googlebot, crawled, and rendered so that Google can analyze its content.
Can I place “noindex” in robots.txt?
It was an undocumented feature of Googlebot. As for now, it doesn’t work.
How can I use GSC to find indexing issues?
- Check the number of indexed pages.
- Check if a given page is indexed.
- Check exactly why a page is not indexed.
- Find interesting crawl stats.
Will the site: command show me all indexed pages?
I found the following fragment in the documentation of Wix (Wix is a popular Content Management System):
“To see if your site has been indexed by search engines (Bing, Google, Yahoo, etc.), enter the URL of your domain with “site:” before it, i.e. “site:mystunningwebsite.com.” The results show all of your site’s pages that have been indexed, and the current Meta Tags saved in the search engine’s index.”
That’s not true. Site:website.com won’t show you every indexed page, and I have gigabytes of data to confirm it.
It shows you just a sample of pages with varying accuracy.
Can I use Google Cache to check how Google indexed my page?
That’s one of my favorite myths.
I don’t want to discuss it at length because we have an excellent article on the topic.
TL;DR: While Google Cache is very useful, don’t rely on it in this context.
Are some pages more prone to not getting indexed?
I noticed that there are types of websites that are most prone to indexing issues:
- Large, rapidly changing websites.
- International websites.
- eCommerce stores that copy content from a manufacturer.
- New websites (!!!).
However, as my statistics show, even small websites with up to 10k URLs can often have indexing issues.
Is having a sitemap enough to get crawled and indexed?
Commonly, especially in the case of large websites, a sitemap is not enough. Google may not crawl a page if it can only find the link in the sitemap. To help your pages reach the crawling priority threshold, make use of internal linking.
Can pages that are blocked in robots.txt be indexed on Google?
Yes. Google can find links to those pages on other pages. Just google “Google Jamboard.”
Can a page get removed from Google’s index?
It may occasionally happen that a page gets indexed by Google, ranks for prominent keywords, and then suddenly gets deindexed. There could be many reasons for that:
- A page is returning 4xx or 5xx errors.
- URLs have a meta noindex meta tag.
- Googlebot can’t access the page (blocked by robots.txt file or through password authentication).
- Google decided it’s duplicate content.
- A page no longer satisfies Google quality standards (especially after core updates).
- Google decided that there isn’t enough storage to keep it and made room for more important pages.
How can I know if Google deindexed my page?
You should visit the Crawled – currently not indexed report in Google Search Console.
However, this report will show you two types of URLs:
- URLs that got deindexed
- URLs NOT YET indexed (may be indexed in the future).
What’s the difference between Crawled – currently not indexed and Discovered – currently not indexed?
I see many people asking this question. It’s very easy; I explained it in the table below:
|Google discovered it||Google visited it||Google indexed it|
|Crawled – currently not indexed||Yes||Yes||At the moment – no.|
|Discovered – currently not indexed||Yes||No||No|
How often does Google crawl my website?
Google Search Console offers some data that will help you answer that question.
- Log on to Google Search Console.
- Navigate to “Crawl” -> “Crawl Stats.”
You can also find out how often Google crawls your website by analyzing your website’s log files, but it requires some expertise.
It’s worth noting that Google determines how often they should crawl your website using the crawl budget for your website.
How to check if a sample of pages is indexed?
In the previous part of the article, I explained how to check how many pages of your website aren’t indexed and why.
But how to check if a specific sample is indexed?
The easiest & most accurate way is to use the URL Inspection Tool.
This way, you can examine other pages of your website. However, after checking around 100 URLs, you will exceed the daily quota.
To check more URLs, you need to use Google Search Console’s Index Coverage report.
Keep in mind that there are up to 1000 URLs available in this report. So if you have a large website, this method won’t fully solve your problem either.
In one of my articles, Diagnosing Indexing Issues using GSC, I wrote about a workaround that you can use to go around the 1000 URL limit.
Another way is to use Google Analytics or Google Search Console.
You can export a list of pages that get more than 0 visits from Google.
If a page gets traffic from Google, then it’s indexed. You should be careful, though – the fact that a page doesn’t get any traffic doesn’t necessarily mean a page is not indexed.
What does Mobile-First Indexing mean?
From now on, all websites are primarily crawled, indexed, and ranked based on their mobile versions.
My website is not indexed. What are the possible reasons?
- Your website is new and Google hasn’t had an opportunity to visit it yet.
- There are no external links from other websites – Google may not be sure if your website is good enough.
- You have technical issues or code that blocks Googlebot from accessing your content.
- Your website got penalized by Google.
- Your internal linking needs some work.
- You have a lot of low-quality, thin content.
Indexing ≠ ranking
As a final note, I need to emphasize that indexing is very important, but it’s not ranking. A page can be indexed and not rank for any keywords.
If you have a large website, you probably have some pages that get next to zero clicks and impressions – just look for them in your Google Search Console account!
Ranking and getting traffic is the final, most rewarding step of the SEO journey. But remember that crawling, indexing, and ranking all belong in the same pipeline and are fully interconnected.
- Google doesn’t index everything. The statistics are staggering. 16% of valuable, indexable pages aren’t indexed.
- At the same time, many large websites are fully indexed; an optimized website is easier for Google to index.
- Indexing is much more complicated than ensuring that a page doesn’t have a “noindex” tag or that it’s not blocked by robots.txt.
- eCommerce websites are particularly prone to indexing issues.
- Unique content helps with indexing while having duplicate content makes it more of a challenge.
- Google Search Console is a crucial tool for diagnosing indexing issues.
- Because the web is growing, we should expect Google to be even pickier when indexing content in the future.
- Having a URL in the sitemap is not enough for a page to be indexed by Google.
- You shouldn’t aim to have every page indexed by Google. Getting low-quality pages indexed can harm your traffic.
- Ranking and indexing are tightly related to crawling and discovering new pages.
- Google can “judge” a page without crawling it by looking at other pages on your site.