How Much Content is Not Indexed in Google in 2019?

quick summary

Here is an edited transcript of Bartosz Goralewicz’s session at SIXT SEO Wiesn, as presented on Friday, October 4, 2019. The topic was “How Much Content is Not Indexed in Google in 2019?” The talk covers Google’s indexing issues, how SEOs need to change their understanding of what a JavaScript website is, Onely’s new toolset, and lots more. You can watch the session in the video below and/or browse the entire deck at the bottom of this page.

Transcript

Today I wanted to talk about an issue that we found that is actually related to JavaScript and SEO, but it’s actually not – it’s not really JavaScript. So we found out that quite a lot of content is not indexed in Google. Even though the websites that have content are not what we would say are JavaScript-powered websites.

When I was preparing to this – to this – to this presentation, like, quite a lot of people reached out to me with questions. And actually, I talked to Bastion Grimm – that you probably know – and he was like, oh no, you’re gonna be talking about JavaScript SEO – isn’t that dying? Isn’t that ending? Isn’t that something that’s a little bit obsolete by now?

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 1.-Not-JavaScript-SEO-Again

I’m gonna make it – I’m gonna make it really interesting for you guys. Because we found out that there are some interesting issues to talk about. A while ago I actually recorded this video. I was talking about the problem of JavaScript SEO is dying, because once Google is gonna get very, very good at rendering JavaScript – the whole need for JavaScript SEO is gonna disappear.

And then a few months back I got invited to Zurich by Martin Splitt . . . to Google Hangouts, and we actually had a lengthy conversation of both Martin Splitt and John Mueller . . . So I shared my opinion with both John Mueller and Martin Splitt about, okay, JavaScript SEO is dying. Because once you guys are gonna have more computing power to – to render it, it’s not going to be needed anymore. And they actually both – John and Martin – went on like a seven – ten-minute monologue, or we could call it a sales pitch for our services – saying that JavaScript SEO is going to be needed more and more and more.

That was all before we had all the interesting research. So after this conversation, we started to look into some of the interesting problems we found. And as we did, we actually realized that JavaScript SEO is a little bit more complex than you thought. So it’s not only websites like Hulu.com, not only websites like Google Flights that are still completely not – that are still having massive issues with indexing their own content because of JavaScript.

Now we found out that all the community is saying, like, there are a lot of people in the community saying that JavaScript is evil – for a good reason. Because it gets – it makes our work extremely complex. It gets – it makes SEOs work – we need to do a lot of extra steps because of that. So I wouldn’t say that JavaScript is evil – I guess it is very, very complex.

But there’s a very good reason for that. But the main problem with saying that JavaScript is evil is that most of the websites right now moved out from HTML to HTML with a lot of JavaScript. It’s going to be very difficult to find any websites you work with that don’t have JavaScript.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 2.-Everything-is-JavaScript

One of the things that the Googlers said in Zurich is that even HTML pages can be rendered mostly because sometimes it’s just cheaper for them or not much more expensive than rendering – than not rendering them, but we found that JavaScript is a problem for non-JavaScript websites and . . . this is something I want to expand on today.

And as an SEO community, we are used to what I call post factum learning, so it’s basically wait until something breaks completely. There is a lot of drama in the community and then everyone is gonna fix that. And I guess the problem that we’re seeing now with JavaScript is that the change is happening but it’s happening very slowly.

It’s not something that’s going to make any website drop massively overnight. And, as I said, JavaScript SEO evolved now, from JavaScript SEO websites that we would see before to all the websites we worked with. And the irony behind all that is – one of my favorite examples is – I used this article for years, for a year, as an example of a very interesting idea, but the content of this article is not important. The article is written by Googlers about the cost of JavaScript. It’s actually not important. What the article is exactly about – the article is published on Medium.

Who of you would say that Medium is a JavaScript website? So seriously, so, so if you look at the comments underneath this article and they are not indexed in Google. And for me, that was like, one of those extremely geeky jokes like, I had a lot of laughs after I found this one. I’m guessing that, and whoever of you laughed, like you’re a geek.

Anyway, if we look at this very page it has 500 referring domains, 2,000 backlinks, like a tremendous amount of reads, claps, and whatever. So it’s a very popular piece of content that we would assume that Google is gonna crawl and render and index pretty often.

Now the problem is that this post was published more than a year ago and Google still didn’t find it. Because it’s – the whole comments part of Medium is powered by JavaScript. And now we’re going into – get into why this exactly happened.

There is a time frame of JavaScript indexing that we’re waiting for two waves of indexing to happen and all that. But it actually got a little bit more complex with some of the research. I’m gonna explain in a second.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 3.-Hundred-of-thousands-of-domains-not-fully-indexed

Long story short we found that thousands of domains are not fully indexed and even after months from publishing the content, so even if you publish your article – amazing article today, it may happen that it’s not going to be like, the URL is gonna be indexed but the content is not gonna be indexed for a few months. Or someone else is gonna overrank you for your own content.

Most of us before that would blame JavaScript. I think the future proved that statement if you’re gonna do that, just blame rendering. Because it kind of evolved towards the rendering recently, and JavaScript is not as big.

Let’s talk about what Googlers told me in Zurich which was actually very, very interesting.

So I asked them how rendering works with Google and I was saying, “ Okay, you’re looking at the difference between the initial HTML. And they look, okay, if they render the content, they see if there’s any change.” So Google is going to look at the HTML version and they’re going to compare that with the rendered version.

Now so if we now think about Medium, let’s say that you’re going to publish an amazing article on Medium, Google is gonna render, like compare the version of this article with the rendered article, there’s gonna be no difference because there are no comments. So this is where we actually start to see a massive issue with the heuristics.

And I can just imagine, Martin said that he hasn’t fully grasped what triggers the heuristics, it’s not because Martin is not good with that – he’s an amazing guy. It’s just basically, I’m guessing that those heuristics are somehow relying on machine learning and things that are just not human readable. So they would see, okay there’s certain heuristics that if they see after a while – they look at the difference between the rendered page and not rendered page.

Again, the Medium example. But I would say that those heuristics are still in the infant stage. They’re still pretty new, they’re still playing and optimizing them, like the Google algorithm in 2006. You probably remember how easy those times were.

And those heuristics are far from perfect. And what Martin actually said is that all new websites get rendered and this is extremely interesting because from my point of view, okay, what’s a new website?

And the second problem I had, okay, all of our experiments that we did at Onely were based on new domains, new IPs, and so on so, most of our experiments were kind of useless from this point of view.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 4.-What-is-a-new-website

What’s a new website? So if you’re gonna – if you’re going to relaunch your CMS, so if you’re gonna publish a new [version of your website], is it going to be a new website or it does have to be a new domain? Or what if a new website doesn’t have some kind of content that’s user-generated?

We started playing with that and we’re like, okay, with a lot of clients you would actually advise to do an experiment on staging before publishing a new CMS. This was dumb looking at looking at how it’s structured like you can’t really test a new CMS because you’re most likely gonna use a new domain. And you can’t really index and play with that within your actual domain.

We decided to experiment with how good Google is with their heuristics – we started playing with that a little bit more. And we started by rerunning all the experiments from 2019 because we found out that Google got much better at indexing JavaScript.

Long story short, we had three domains, obviously new IPs, new domains, the content generated with Articoolo, and we compared that to our experiments from 2017. In 2017 we had a page where the homepage would link down with HTML links to like six or five levels deep, and with JavaScript, Google would only index homepage, go one link deep, because the link was generated by JavaScript, and they would give up.

Actually, this page is still not indexed after two years. So we repeated that experiment, just with a lot of different domains like jscrawling.party or HTMLcrawling.wine. We went all crazy on the new TLDs. And like jscrawling.pizza is one of my favorites.

We played with that for a bit and all of those were indexed within literally minutes. For one of those you had to wait one day, but then it was indexed within minutes. So we saw a massive change – massive improvement with how Google is dealing with JavaScript content. But now we know it’s all because it was a new domain so our experiment from 2019 turned out to be somehow widely successful, but this is not really useful for any of us because this was a new domain.

Google actually switched on crawling and rendering for a set period of time that we don’t really know, and just to see if JavaScript is changing the content here. So the new Google actually wins.

And, yeah, that’s a great job, that’s a massive improvement with every single experiment. We couldn’t create an experiment even with a massive JavaScript load, so like heavy JavaScript scripts to force Google to somehow fail to index it. And we played with the HTML to JS ratio, so we figured maybe it actually depends on how much content.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 5.-JS-to-HTML-ratio

We had everything injected by JavaScript – just one paragraph and one word. We had three types of test pages. All of them were indexed within, again, almost in minutes. So most of the content was indexed within a half hour. Five URLs, we had to wait for a little bit and for Google just to crawl it – not to index – just to crawl it. But then it was indexed again almost instantly.

After four hours, all 29 out of 30 pages were indexed and after eight hours all the test domains were indexed completely. So this actually turned out to be a massive win for Google as well.

We couldn’t somehow force Google to fail with indexing. Something that wasn’t possible two years ago at all. So as I said, this is quite a change that’s not somehow visible in the industry.

We don’t give up easily. We figured we’re gonna create one more experiment. We’re gonna relaunch this experiment from 2017 that was massively popular. We did. Long story short, I’m guessing you know where it goes – again Google didn’t choke on any of the scripts, any of the frameworks, any of the setups, inline, external, doesn’t matter.

Again Google won. So Martin Splitt was completely right about all the new websites. This is something that they did – they designed well and it actually works as they designed it for, again, new websites.

But what about popular websites?

If you’re – we can’t really play with experiments, we’re going to look at existing domains. And this is where it got really interesting. So can Google deal with real websites that have a little bit of content generated by JavaScript.

This was one of our most complex experiments that kind of dragged for a few weeks. It got a little bit out of control because we spent way too much time on it after seeing some of the changes.

National Geographic is a website that’s actually almost completely – you would say that this is a JavaScript-powered website because when you switch off JavaScript, everything disappears. Like the content is invisible, almost completely – there is just a headline. Fun fact: this website has no issues with JavaScript indexing.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 6.-National-Geographic-No-JavaScript

We used a lot of random samples. We could never get National Geographic to choke with indexing, so this was a first for us we actually saw, ok, there is a massive brand not having any JavaScript issue.

ASOS with JavaScript, ASOS without JavaScript, and so most of the content is gone. And as you can already guess, 100% of the JavaScript cannot index. So this is a massive change as well, something that we couldn’t see in the wild two years ago.

But not every website is lucky enough. This is enough of the positive examples here. We’re gonna go into the most interesting zone of things that don’t work, which is usually what SEOs love the most.

We actually took a random sample of URLs and this is where things got really interesting. So with a random sample, it was a few hundred URLs, Urban Outfitters, no JavaScript-generated content indexed, J. Crew, Topshop, Sephora 40%, H&M 73%, and T-Mobile – which is obviously run by German SEOs – did the best. It doesn’t surprise me at all.

So this was a random sample. Now let’s look into the two waves of indexing, how that actually works. What’s the timeframe of that? So we figured, okay, we know that something is not indexed, how is time gonna affect that? We looked at some of the interesting domains and this is a percentage of JavaScript content not indexed after 14 days. So we would see – we would fetch sitemaps – we see, okay, this is a just-published page – we would wait until the HTML or the URL is indexed and from that moment of the URL being indexed, you would measure time until JavaScript content indexed, and this is the project that got out of hand because it grew way more than we expected.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 7.-Percentage-of-JS-content-not-indexed-after-14-days

The Guardian has 66% of content not indexed after two weeks. And you would assume that a newspaper – and this is not like a tiny bit of page – this is like a massive – I think it’s you might also be interested in or like all everything that they do for internal linking, or maybe not everything like – good bunch of links they used for internal linking, for new content is not indexed.

Target – 30%. New York Post is actually very good at that. But CNBC, that’s – I won’t even comment. And CNBC had a massive JavaScript issue. None of those websites here is something that we would, a year or two ago would call JavaScript-powered websites.

This is where we as a community kind of failed recently, because we would blame Google for that for quite a bit, and because sometimes as a community that’s what we do best. I was guilty of that as well because looking at JavaScript SEO over one or two years recently is most of what SEOs did for JavaScript SEO, just recommended pre-rendering and that’s it.

In this case, I feel it gets a little bit more complicated because every single JavaScript SEO issue we saw was 100% self-induced, so we need to look at ourselves and developers and webmasters to fix those because a lot of those issues are — actually, every single one we worked with wasn’t Google’s fault. It was basically a design flaw of how we use JavaScript.

The JavaScript community is growing very fast as well, so this is one of the side effects. Moving forward and talking about the timeline, we’re waiting for crawling and indexing to come together. This is something that Google is telling us they are going to announce pretty soon-ish because they are, as we see, getting better and better with indexing the content. I would say that this is happening soon. I wouldn’t say this is happening next year or this year but we’re waiting for that.

When the two waves of indexing come together, in theory. the problem of JavaScript is going to fade away. But it’s gonna be a while. This is first.

Secondly, I wonder how Google is going to deal with pages so they would have to render everything, like Medium, like The Guardian, so I wonder how granular is it going to be and at which point they’re gonna – we’re rendering every single page online. So what to do?

We had some big news a few days ago and I want to explain that a little bit.

We created OMFG (Onely Made For Geeks), a toolset that actually helps you, like a free toolset that helps you see some of the problems because we said, “Okay, there is no way to see if your website has a JavaScript problem or not.”

TGIF is the Google indexing forecast part of the toolset. We look every single day at the percentage of pages with JavaScript that’s not indexed. This is a manually created data set of large brands and footprints, and we see how is it changing just for us SEOs to see Google is getting pretty close to indexing everything properly.

And you can see also like after one week or after two weeks if anything is going to change.

We are including any brand, so if you send us, for example, The Guardian or any other page, if you send us the page we see there is a part that’s realized in JavaScript, we footprint that part, we create something like “you may also be interested in…” We fetch the sitemap.

It’s quite a lot of work to do that manually, but I think that the database now is going to around 100 to 200 websites – big websites – but for each page, we take quite a lot of URLs. It’s constantly growing. So last time, I talked to our research development team and they were adding like tens of pages per day, so it’s growing. And we manually footprint that. There is no way to automate that, actually.

You can also compare what’s – if you want to get geeky – HTML delay and JavaScript delay. When JavaScript is not indexed this is not a JavaScript issue, this is a crawler budget problem.

They said if your JavaScript is not getting indexed look into your crawler budget. So I figured “Maybe there is some kind of correlation this is something this is an experiment we’re working on.” But we saw that quite a lot of pages struggle with having the HTML content indexed within even days, so this was a very good lead to look into.

Anyhow, you can have a look at, for example, after two weeks you can see some of the pages, like 90-something percent of the HTML indexed. Out of that, 70% of JavaScript is indexed, so you can see the delay here is really big still. So after two weeks, HTML indexing for some of the dates would be like eighty percent for very big brands, so this is something we had to like quadruple check to publish because we couldn’t believe that.

There’s one tool that’s actually quite interesting within the toolset. WWJD – What Would Javascript Do? and it compares the JavaScript disabled version versus JavaScript enabled version to see if there’s any difference.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 9.-Onely-tool-WWJD

You can see on Hulu.com that quite a lot of content disappears and we could assume that this is the content that’s realized in JavaScript. In this case, these are images, but if you have any text there’s a very good chance that Google is not going to pick it up, like the comments from Medium. If you look at bbc.co.uk, there’s quite a lot of content that relies on Javascript that disappears without JavaScript indexing.

There’s also one interesting thing that we compare, is the non-rendered version versus rendered of major meta tags, because quite a lot of websites would change their canonicals to noindex with JavaScript. And to be honest with our experiments we’re never sure which version Google is gonna respect more because we serve both or mixed versions in our experiment.

Again, looking at BBC – this is a very interesting example – or HTML, BBC is showing a BBC home for JavaScript, after rendering it changed to the BBC homepage. But it gets really interesting with canonicals because after rendering Javascript, the domain is gonna change.

This is BBC, so you would expect that to go a little bit better. This may be somehow influenced by how we test that, because we still can’t believe that they would do it like that, but even if it is this is something to look into. But these problems are definitely for BBC to fix.

You can see one more thing that’s actually very interesting – links added by JavaScript. If you crawl your website without JavaScript, for example, Ryte or DeepCrawl, or crawler of your choice, you will see two massive data – like two different datasets. There are different line graphs for those two websites.

Which one matters in the end?

This is something we can’t agree on internally in the office. One of the easiest ways to pick a fight at Onely, just ask that question. Also, links removed by JavaScript, which is even more interesting because then it gets really confusing at this point. So if we look at BBC, this problem is real.

And Too Long; Didn’t Render, so TL;DR, is the last part of our amazing toolset, where you can actually see the cost of rendering your page. You can see it’s based on CPU and memory. And there is one winner in our case, I had to run it a few times to just to go to the green zone. But there is one page– okay, BBC kind of got crazy with the how they are, so why do you need that?

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 10.-Onely-tool-TLDR

I should lead with that. You need that because if, in the case of the BBC, if your users have cheaper mobile devices that – this topic I spoke about quite a lot of times already- if your users have cheaper mobile devices, BBC is gonna choke on them.

Motorola’s G4, cheaper Android devices, older iPhones are not gonna deal well with such a load of rendering.

There is one page at one, this page – this page is still our number one. This is SEOktoberfest – this page made by Marcus [Tandler].

There is – there’s zero CSS and the score is – the score is 2. So this is the most amazing score we had because the cost of rendering is almost zero and it’s only because of one image. I guess if you would remove that image it would go down to 1, so maybe you should go backward in development.

But you can see that some other – like what we see quite often if your content relies on Google it is very easy to overrank you with content that’s rendered. So, yeah, that’s more or less it.

There are no links – there are links added by JavaScript. So this is a massive pain in the ass for your technical SEO team or your – because if some content can – if you add links your internal links graph is going to change. So this is just a use case of how you can do that.

You can see that this problem also comes up in mobile-friendly tester somehow. And there are quite a lot of tools we’re actually building to launch, but you can see that this is quite useful to play with your domains. It’s completely free.

Let’s talk about HTML. Let’s go old-school for a second.

Let’s see how quickly Google is gonna index HTML content from The Guardian. Looking at 1300 URLs, Google indexed 98% of the HTML content from The Guardian. Pretty decent – I wouldn’t complain, but it doesn’t look as good for other brands. So the Guardian is fairly good, like most of the HTML – this is just HTML. So just if the URLs indexed, The Guardian is very good.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 11.-The-Guardian-Reuters-Eventbrite-Target-comparison

But if we look at, for example, Target, and you can expect an e-commerce website that’s fairly – fairly big in the States to be very well. So after two weeks, they get to 80% – two weeks from publishing products and content. So we can see, okay, this problem we’re seeing is much bigger than just JavaScript SEO.

Because if we look at, for example, okay, Eventbrite, Eventbrite has 55 or 56 percent of their content indexed after two weeks. So you can imagine how much of a problem that is. Because, because they will optimize those pages. They’ll ask, “Okay, why don’t we rank well?” When actually half of their pages are not indexed, half of their domain – so, so, so HTML seems to be very problematic . . .

Medium is one – my final example, Medium is medium in indexing. That’s the joke. Thank you. So with the quick – with a quick check of 100 URLs from Medium, only 70% – and this is a random check – like this wasn’t like after a timeframe, 70% of them are indexed in Google. So Medium has massive issues indexing their content – something you wouldn’t expect from a content platform. So, and then 50% of the content that’s indexed has JavaScript content indexed . . .

And so you can see that this is a problem because – again, this is not a JavaScript problem. This is basically a crawler budget HTML issue.

Just one last slide – this is an example that I’m using quite a few times in a lot of conferences to actually show that this is dangerous. This is something to either, depending on the side of SEO you are either playing with or avoid.

We created quite a lot of pages with content that is somehow sensible to a lot of people like gun control, Trump versus Hillary, or Peppa Pig – if you saw the Peppa Pig drama with some of the violent content.

How-Much-Content-is-Not-Indexed-in-Google-in-2019 - 12.-nomoregunsusa.com-case-study

We basically did a website that shows two completely different stories depending on if you switch on or switch off JavaScript. Google couldn’t pick it up for – this one I think for a year. So you can actually visit the page and you can cloak anything with just – with JavaScript and, and this is extremely dangerous because if you Google – okay, Google is seeing that there should be gun control.

So all the search engines will see, okay, gun control should happen, but once you render JavaScript – JavaScript is replacing not only the header but the whole content to no gun control. And we saw some examples in the wild of people doing that for some of the big brands in the US which were in the car industry as well, but not currently, unfortunately.

A lot of people are playing with that to somehow inject quite a lot of content for search engines, but not for users. So they would say, okay, you would see – what we actually saw is – you see just a listing page with just a photo of let’s say a car, and in the description, but Google is seeing like a massive spreadsheet of data and everything, and this still works very well.

This is something that Google can’t fix for, I’m guessing, technological reasons. Or maybe they don’t – maybe the scale is not big enough for them to worry about it. So this is something I actually showed it to Martin Splitt as well. So, so maybe they will somehow address that but this is a massive issue. So again depending on the side of SEO you’re on, you’ll avoid that or play with it.

Thank you so much. This is it. More data and more tools are coming soon.