SEO Office Hours, July 1st, 2022

seo-office-hours-july-1st-2022 - seo office hours july 1st 2022 hero image

This is a summary of the most interesting questions and answers from the Google SEO Office Hours with John Mueller on July 1st, 2022.

PageSpeed Insights or Google Search Console ‒ which one is more accurate?

0:44 “When I check my PageSpeed Insights score on my website, I see a simple number. Why doesn’t this match what I see in Search Console and the Core Web Vitals report? Which one of these numbers is correct?”

According to John: “[…] There is no correct number when it comes to speed ‒ when it comes to an understanding how your website is performing for your users. In PageSpeed Insights, by default, I believe we show a single number that is a score from 0 to 100, which is based on a number of assumptions where we assume that different things are a little bit faster or slower for users. And based on that, we calculate a score.

In Search Console, we have the Core Web Vitals information, which is based on three numbers for speed, responsiveness, and interactivity. And these numbers are slightly different, of course, because it’s three numbers, not just one number. But, also, there’s a big difference in the way that these numbers are determined. Namely, there’s a difference between so-called field data and lab data.

Field data is what users have seen when they go to your website. And this is what we use in Search Console. That’s what we use for Search as well. Whereas lab data is a theoretical view of your website, where our systems have certain assumptions where they think, well, the average user is probably like this, using this kind of device, and with this kind of a connection, perhaps. And based on those assumptions, we will estimate what those numbers might be for an average user. You can imagine those estimations will never be 100% correct.

Similarly, the data that users have seen ‒ that will change over time, as well, where some users might have a really fast connection or a fast device, and everything goes fast on their website or when they visit your website, and others might not have that. And because of that, this variation can always result in different numbers.

Our recommendation is generally to use the field data, the data you would see in Search Console, as a way of understanding what is the current situation for our website, and then to use the lab data, namely, the individual tests that you can run directly yourself, to optimize your website and try to improve things. And when you are pretty happy with the lab data that you’re getting with your new version of your website, then over time, you can collect the field data, which happens automatically, and double-check that users see it as being faster or more responsive, as well.

So, in short, again, there is no correct number when it comes to any of these metrics. […] But, rather, there’s different assumptions and different ways of collecting data, and each of these is subtly different.”

Why does Googlebot struggle with indexing JavaScript-based pages?

4:19 “We have a few customer pages using Next.js without a robots.txt or a sitemap file. Theoretically, Googlebot can reach all of these pages, but why is only the homepage getting indexed? There are no errors or warnings in Search Console. Why doesn’t Googlebot find the other pages?”

John said, “[…] Next.js is a JavaScript framework, which means that the whole page is generated with JavaScript. But a general answer, as well, for all of these questions like, why is Google not indexing everything ‒ it’s important to first say that Googlebot will never index everything across a website. I don’t think it happens to any non-trivial-sized website that Google would go off and index completely everything. From a practical point of view, it’s not possible to index everything across the whole web. So that assumption that the ideal situation is everything is indexed ‒ I would leave that aside and say you want Googlebot to focus on the important pages. 

The other thing, though, which became a little bit clearer when, I think, the person contacted me on Twitter and gave me a little bit more information about their website, was that the way that the website was generating links to the other pages was in a way that Google was not able to pick up. So, in particular, with JavaScript, you can take any element on an HTML page and say, if someone clicks on this, then execute this piece of JavaScript. And that piece of JavaScript can be to navigate to a different page, for example. And Googlebot does not click on all elements to see what happens but, rather, we go off and look for normal HTML links, which is the traditional, normal way that you would link to individual pages on a website.

And, with this framework, it didn’t generate these normal HTML links. So we could not recognize that there’s more to crawl, more pages to look at. And this is something that you can fix in the way that you implement your JavaScript site. We have a ton of information on the Search Developer Documentation site around JavaScript and SEO, in particular, on the topic of links because that comes up every now and then. There are lots of creative ways to create links, and Googlebot needs to find those HTML links to make it work. […]”

Do you want to learn more about JavaScript SEO?

Check how we can help you with our JavaScript SEO services.

And except for the Google official documentation, read the Ultimate Guide to JavaScript SEO on our blog.

Does linking to HTTP pages influence your website’s SEO?

7:35 “Does it affect my SEO score negatively if my page is linking to an external insecure website? So on HTTP, not HTTPS.”

John said, “First off, we don’t have a notion of an SEO score, so you don’t have to worry about SEO score.

But, regardless, I understand the question is like: is it bad if I link to an HTTP page instead of an HTTPS page. And, from our point of view, it’s perfectly fine. If these pages are on HTTP,  then that’s what you would link to. That’s what users would expect to find. There’s nothing against linking to sites like that. There is no downside for your website to avoid linking to HTTP pages because they’re old or crusty and not as cool as on HTTPS. I would not worry about that.”

Should you delete your disavow file?

10:16 “Over the last 15 years, I’ve disavowed over 11,000 links in total. […] The links that I disavowed may have been from hacked sites or from nonsense, auto-generated content. Since Google now claims that they have better tools to not factor these types of hacked or spammy links into their algorithms, should I delete my disavow file? Is there any risk or downside to just deleting it?”

John answered, “[…] Disavowing links is always one of those tricky topics because it feels like Google is probably not telling you the full information.

But, from our point of view, […] we do work hard to avoid taking these links into account. And we do that because we know that the Disavow links tool is somewhat a niche tool, and SEOs know about it, but the average person who runs a website has no idea about it. And all of those links that you mentioned are the kind of links that any website gets over the years. And our systems understand that these are not things that you’re trying to do to game our algorithms.

So, from that point of view, if you’re sure that there’s nothing around a manual action that you had to resolve with regards to these links, I would delete the disavow file and […] leave all of that aside. One thing I would personally do is download it and make a copy so that you have a record of what you deleted. But, otherwise, if you’re sure these are just the normal, crusty things from the Internet, I would delete it and move on. There’s much more to spend your time on when it comes to websites than just disavowing these random things that happen to any website on the web.”

Is it better to block crawling with robots.txt or the robots meta tag?

14:19 “Which is better: blocking with robots.txt or using the robots meta tag on the page? How do we best prevent crawling?”

John: “[…] We did a podcast episode recently about this, as well. So I would check that out. […]

In practice, there is a subtle difference here where, if you’re in SEO and you’ve worked with search engines, then probably you understand that already. But for people who are new to the area, it’s sometimes unclear exactly where all of these lines are. 

With robots.txt, which is the first one that you mentioned in the question, you can block crawling. So you can prevent Googlebot from even looking at your pages. And with the robots meta tag, when Googlebot looks at your pages and sees that robots meta tag, you can do things like blocking indexing. In practice, both of these result in your pages not appearing in the search results, but they’re subtly different.

So if we can’t crawl, then we don’t know what we’re missing. And it might be that we say, well, actually, there’s a lot of references to this page. Maybe it is useful for something. We don’t know. And then that URL could appear in the search results without any of its content because we can’t look at it. Whereas with the robots meta tag, if we can look at the page, then we can look at the meta tag and see if there’s a noindex there, for example. Then we stop indexing that page, and then we drop it completely from the search results.

So if you’re trying to block crawling, then definitely, robots.txt is the way to go. If you don’t want the page to appear in the search results, then I would pick whichever one is easier for you to implement. On some sites, it’s easier to set a checkbox saying that I don’t want this page found in Search, and then it adds a noindex meta tag. On others, maybe editing the robots.txt file is easier. [It] depends on what you have there.”

Are your pages “Blocked by robots.txt” in Google Search Console?

Read my article to ensure you blocked your pages from crawling on purpose.

Can you place the same URL within multiple sitemap files?

16:40Are there any negative implications to having duplicate URLs with different attributes in your XML sitemaps? For example, one URL in one sitemap with an hreflang annotation, and the same URL in another sitemap without that annotation.”

John said, “[…] From our point of view, this is perfectly fine. […] This happens every now and then. Some people have hreflang annotations in sitemap files specifically separated away, and then they have a normal sitemap file for everything, as well. And there is some overlap there.

From our point of view, we process these sitemap files as we can, and we take all of that information into account. There is no downside to having the same URL in multiple sitemap files. 

The only thing I would watch out for is that you don’t have conflicting information in these sitemap files. So, for example, if with the hreflang annotations, you’re saying, this page is for Germany, and then on the other sitemap file, you’re saying, well, actually this page is also for France, […] then our systems might be like, well, what is happening here? We don’t know what to do with this mix of annotations. And then it can happen that we pick one or the other.

Similarly, if you say, this page has been last changed 20 years ago […], and in the other sitemap file, you say, well, actually, it was five minutes ago. Then our systems might look at that and say, well, one of you is wrong. We don’t know which one. Maybe we’ll follow one or the other. Maybe we’ll ignore that last modification date completely. So that’s the thing to watch out for.

But otherwise, if it’s just mentioned multiple sitemap files and the information is either consistent or works together, in that maybe one has the last modification date, the other has the hreflang annotations, that’s perfectly fine.”

How to prevent embedded video pages from indexing?

19:00 “I’m in charge of a video replay platform, and our embeds are sometimes indexed individually. How can we prevent that?”

John answered: “[…] I looked at the website, and these are iframes that include a simplified HTML page with a video player embedded in that.

From a technical point of view, if a page has iframe content, then we see those two HTML pages. And it is possible that our systems indexed both of those HTML pages because they are separate HTML pages. One is included in the other, usually, but they could theoretically stand on their own, as well.

And there’s one way to prevent that, which is a fairly new combination with robots meta tags that you can do, which is with the indexifembedded robots meta tag together with a noindex robots meta tag.

And on the embedded version, so the HTML file with the video directly in it,  you would add the combination of noindex plus indexifembedded robots meta tags. And that would mean that if we find that page individually, we would see there’s a noindex [tag]. We don’t have to index this.

But with the indexifembedded, it tells us that […] if we find this page with the video embedded within the general website, then we can index that video content, which means that the individual HTML page would not be indexed. But the HTML page with the embed, with the video information, that would be indexed normally. So that’s the setup that I would use there. And this is a fairly new robots meta tag, so it’s something that not everyone needs. Because this combination of iframe content or embedded content is rare. But, for some sites, it just makes sense to do it like that.”