Author: Glenn Gabe (Glenn Gabe) - SEO-consultant in the agency G-Squared Interactive. Working in the field of digital-marketing more than 20 years.
Over the past years, we have repeatedly faced with an interesting situation with robots.txt, which can be difficult for site owners. After identifying the problem and talking with customers about how to solve it, we found that many people do not even know what can happen. And since we are talking about robots.txt files, it can have a great impact on SEO.
We mean robots.txt files that are processed by sub-domains and protocols. In other words, the site can be multiple robots.txt files added by subdomains: www and non-www, or protocols: https www and http www.
Since Google processes each of them individually, you can transfer completely different instructions on how the site should be crawled.
In this article we will look at two real examples of sites that are faced with this problem. We also take a look at Google documentation robots.txt and look at how you find other files.
Google's approach to the treatment of robots.txt files
Above we mentioned that Google handles the robots.txt files of subdomain and protocol. For example, the site may have a robots.txt file in the version with www and a completely different version without www. In our practice we have seen several times this situation at customer sites, and recently faced with it again.
In addition to the www and non-www, website can also have a robots.txt file, located in https- and http-version subdomain. Thus, there may be several files with different robots.txt instructions for search robots.
Documentation Google clearly explains how to handle a robots.txt file. Here are a few examples of how to apply the instructions found:
This approach can definitely cause problems, because Googlebot can get different robots.txt files to the same site and in different ways to scan each site. And then it is possible that when the site owners believe that Googlebot performs a set of instructions at the time, he also receives another set during the other rounds of the site.
Below we consider two cases, where we are faced with such a problem.
Case number 1. Different robots files. txt with conflicting directives www and www versions of the non-
Recently, performing an audit scan on one of the sites, we noticed that some of the pages that are blocked in robots.txt, in fact is scanned and indexed. We know that Google is 100% in compliance with the instructions in a robots.txt file, so it was an obvious red flag.
It should be noted, we have in mind the URL, which are scanned and indexed as usual, despite the fact that the instructions in the robots.txt should be disallowed from crawling. Google may also index the URL-addresses that are blocked by a robots.txt file, without scanning them, but this is a different situation, which will be discussed below.
Checking the robots.txt file manually, we saw a set of instructions for the version without www, and there had been prescribed limit. Then we started to manually check other versions of the site (for subdomain and Protocol) to see if there are any problems there.
And they were in the www subdomain was another by robots.txt. And as you can guess, it contains other instructions.
The site was not properly forwarding robots.txt for-www version to the version without www. So, Google was able to gain access to both a robots.txt file and find two different sets of instructions for scanning.
Again, in our experience, many website owners do not know that such situations are possible.
- A brief note about blocked pages that can be indexed
Earlier we mentioned that page correctly blocked in the robots.txt file can be indexed. They just will not be scanned.
We know that this is a complicated subject for many website owners, but Google can definitely index pages that are blocked. For example, it is possible in that case, when Google sees the incoming links pointing to the page.
When this occurs, the Google indexes the URL-addresses and points in the search results that the information on these pages is not. They will appear without description.
But this is not a situation that we consider in this article. Here is a screenshot of the FAQ Google for a robots.txt , which speaks about the possibility of indexing the blocked URL:
And what about the Search Console and robots files. txt?
In Search Console there is an excellent tool that can be used to debug the robots.txt file - Robots.txt Tester .
Unfortunately, many website owners, this tool difficult to find. It is no reference in the new Search Console. But it can be accessed from the Help Center.
Using this tool, you can view previous robots.txt files that saw Google. As you can guess, we saw both the robots.txt file on the site analyzed. So, yes, the Google actually sees the second file.
Having identified the problem, we quickly sent to the client all the necessary information, screenshots, etc. We also told them to remove the second robots.txt file and set up a 301 redirect from www-version to the version without www. Now that Google will go to the site and check a robots.txt file, it will see the correct set of instructions.
Moreover, the site remained the URL, which have been indexed due to mixed directives. Therefore, our client now opens the URL to be crawled, but ensures that the files are blocked from being indexed by meta robots.
When the total amount of the URL in the GSC fall, we again add correctly implemented Directive disallow, to block this area.
Case number 2. Different robots files. txt for http and https
Several years ago, we were addressed by a webmaster in relation to the fall of organic search traffic for a site for no apparent reason.
Scratch, we decided to test different versions of the protocol of the site (including the robots.txt files for each version).
When you try to check the https-version of a robots.txt file, we first have to see the security warning in Chrome. And once we did, we saw the second file a robots.txt, which blocked the entire site from the scan.
The https-versions robots.txt file, it has been completely banned by a directive Disallow: /.
In addition, the site had other problems, but there are several in a robots.txt file, one of which is completely prohibited scanning, can hardly be called optimal.
Https-version of the file robots.txt (hidden for security warning in Chrome):
Site problems that appear in the Search Console for https-resource:
View https-version of the site as Googlebot shows that it is locked:
As in the first case, the owner of the site quickly solved the problem (which was not easy, given their CMS).
This is another good example of how Google handles the robots.txt file and what is the danger of having the site multiple files on different subdomains or protocols.
How to find multiple files robots.tx t: Tools
There are several tools that can be used in addition to a manual check for robots.txt files or sub-domain protocol.
They can also help to see what robots.txt files previously displayed on the site.
- Tool robots check. txt in Search Console
This tool, which we have mentioned above, allows you to see the current robots.txt file and the previous version of the Google processed.
It also functions as a "sandbox", where you can test the new directive.
In general, this is a great tool that Google inexplicably placed in a corner.
- Wayback Machine
Internet Archive may also be useful in this situation. We have already examined its use in his column on Search Engine Land.
However, the Wayback Machine can be used not only for checking of standard web pages. This tool also allows you to view those robots.txt files that were previously on the site.
Thus, it is a great way to keep track of previous versions by robots.txt.
Solution: redirect 301
To avoid problems with the robots.txt protocol on or subdomain, you need to implement a redirect robots.txt file to the correct version using the 301 redirect.
For example, if the site is on the www, you want to redirect subdomain robots.txt in the non-www version to the www.
As for https and http, then you have this forwarding must be configured. Just make sure the robots.txt file is forwarded to the appropriate protocol and version subdomain. And also - that all URL correctly redirected to the correct version.
For other subdomains, you can select a separate robots.txt file, which is quite normal. For example, you may have a forum, located on the subdomain forums.domain.com, and his instructions may differ from the instructions for the www-version.
In this article we are talking about www / non-www and http / https for the main site.
Instead of a conclusion: make sure that the instructions match in robots.txt files
Since robots.txt controls scanning, it is extremely important to understand how Google handles these files.
Some sites may contain multiple files robots.txt on separate subdomains and protocols with different instructions. Depending on how Google crawls your site, it can find one of these files, which can lead to problems.
Therefore, it is important to make sure that all robots.txt files contain the agreed guidelines.