What are Crawlers?
If you’re seeking a way to detect and verify crawlers, you probably already know what they are. Nonetheless, crawlers (called spiders sometimes) are computer programs (bots) that crawl the web. In other words, they visit webpages, find links to further pages, and visit them, too. Often they map content that they find to use later for search purposes (indexing), or help developers diagnose issues with their websites.
Why Would Anyone Want to Detect Them?
Obviously, in order to do so you have to know if a request was made by a real user, or by a crawler.
User Agent detection – Hello, my Name is Googlebot
When you’re browsing the web, you might sometimes feel anonymous. Your browser, however, never does. Every request it makes has to be signed with its name, called User Agent. For example, that’s the User Agent of a Chrome browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36 Bots also have unique User Agents , for example the following name belongs to the desktop version of Googlebot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Couldn’t they Just Lie?
Yes, indeed. A true Googlebot will not be deceptive, and will introduce itself with its true name. However, there are other bots that might be harmful that will introduce themselves with Googlebot’s name. Some browsers can also change the User Agent. For example, you can fake Googlebot hits using the Google Chrome Inspect tool. We SEOs also often visit pages, or even crawl whole sites, introducing ourselves as Googlebot for diagnostics purposes. However, if you’re looking for a way to detect all requests from a specific bot, and you don’t mind including requests from sources that lie about their identity, the User Agent detection method is the easiest and fastest to implement.
If you need to verify a request’s source properly, you need to check the IP address from which the request was made. Lying about that is difficult. One can use a DNS proxy server and hide the true IP, but that will reveal the proxy’s IP, which can be identified. If you are able to identify requests that originate from the crawler’s IP range, you are set. Some crawlers provide IP lists or ranges for you to use, but most of them, including Googlebot, don’t. And there are good reasons not to. Nonetheless, they provide a way to validate the request IP. Before I explain how to do it, let’s backtrack a little bit and explore scenarios in which you should validate crawler requests.
- The first scenario that we’ll explore is server logs analysis. You surely don’t want that pesky scraper that visited your site to show as Googlebot in your logs. Imagine that for some reason part of your site is not indexed, because it’s blocked in robots.txt , but in your logs you could see hits to that part made by a scraper that doesn’t give a damn about robots.txt. How are you going to establish if the true Googlebot was able to access these pages or not, if you don’t filter that scraper out?
Ok, now let’s get to the meat.
As stated above, some popular search engine crawlers provide static IP lists or ranges. I’ll list some here. DuckDuckGo:
Source: https://duckduckgo.com/duckduckbot Ask.com:
Source: https://www.distilnetworks.com/bot-directory/bot/teoma/ Twitter and Facebook let you download their current IP lists by running the following Bash commands.
|whois -h whois.radb.net — ‘-i origin AS32934’ | grep ^route|
|whois -h whois.radb.net — ‘-i origin AS13414’ | grep ^route|
Bash is a Linux command line environment, which you can simulate on Windows using CygWin.
Googlebot Verification – DNS Lookup
For bots that don’t provide official IP lists, you’ll have to perform a DNS lookup in order to check their origin. DNS lookup is a method of connecting a domain to an IP address. As an example I’ll show you how to detect Googlebot, but the procedure for other crawlers is identical. In the case of bot verification you’ll start with a request IP address, and will try to determine its origin domain. The first step in the process is called reverse DNS lookup, in which you’re going to ask the server to introduce itself with the domain name. If you’re using Windows Command Prompt, you are going to use the nslookup command. On Linux the equivalent command is host.
Evaluate the nslookup command with the request IP and read the domain name. It has to end with the correct domain. The correct domain for Googlebot is .googlebot.com. It’s not enough to search the name for that string. To ensure proper verification, it has to be on the very end! For example a domain named googlebot.com.imascam.se definitely doesn’t belong to a valid Googlebot (I’ve just made it up).
How to be 100% Sure?
There is a way to cheat this method. One can set up a redirect from their scam server to the valid Googlebot server. In this case if you ask the server for its name, you’ll get the proper Googlebot domain! In order to rule that possibility out, you have to ask the domain name for its IP address. You can do that using the very same command, but this time with the domain’s name instead of IP address.
If the IP address from the response matches the IP of the request, you’re set. You’ve validated a true Googlebot! Here’s a list of popular crawlers’ domains:
|Service Name||Domain name|
A small bonus: in the case of Bing, you can verify the IP directly on this page but you cannot automate the verification process, as it’s human-only.
At this point you’re probably asking yourself why Google hasn’t published their IP list like Facebook did. The answer is simple: their IP ranges may change in the future. Such a list will surely survive in some server configurations, making them vulnerable to deception in the future. Nonetheless, you shouldn’t use the lookup method for every request! That will kill your Time to First Byte (TTFB), and ultimately slow down your website. What you want to do instead is to create a temporary IP whitelist. The basic idea is when you get a request from Googlebots’ user agent, you check your whitelist first. If it’s on the list, you know it’s a valid Googlebot. In cases where it’s coming from an IP address that’s not on the whitelist, you’d want to do the nslookup. If the address is verified positively, it enters the whitelist. Keep in mind that the whitelist is temporary. You should periodically remove, or re-check all the IP addresses. If you’re getting a lot of false requests, you might want to think also about a blacklist to rule out such requests without doing the DNS lookup. Below you’ll find a simple diagram that represents the idea described above.
Before you jump into implementing these solutions, ask yourself what you really need. If you need to detect bots and don’t mind false positives, then go for the simplest User Agent detection. However, when you’re looking for certainty, you’ll need to develop the DNS lookup procedure. While doing so, keep in mind that you really want to avoid increasing your server response time, which DNS lookup will certainly do. Implement some method of caching the lookup results, but don’t hold them for too long, because IP addresses of search engine bots may change.