Google has been making some big announcements in the recent week about the robots.txt file but this one if one of the biggest announcements.
If you weren’t aware robots.txt is 25 years old, Google has finally decided to make it a standard so that everyone knows the rules and follows them.
If you are interested in reading what changed you can, otherwise I will try and summarise it below.
- Robots.txt now accepts all URI-based protocols.
- Removed the “Requirements Language” section in this document because the language is Internet draft specific.
- For 5xx, if the robots.txt is unreachable for more than 30 days, the last cached copy of the robots.txt is used, or if unavailable, Google assumes that there are no crawl restrictions.
- Google treats unsuccessful requests or incomplete data as a server error.
- “Records” are now called “lines” or “rules”, as appropriate.
- Google doesn’t support the handling of elements with simple errors or typos (for example, “user agent” instead of “user-agent”).
- Google currently enforces a size limit of 500 kibibytes (KiB), and ignores content after that limit.
- Updated formal syntax to be valid Augmented Backus-Naur Form (ABNF) per RFC5234 and to cover for UTF-8 characters in the robots.txt.
- Updated the definition of “groups” to make it shorter and more to the point. Added an example for an empty group.
- Removed references to the deprecated Ajax Crawling Scheme.
I have highlighted them above and while everything is important – those are the ones I am going to cover in a bit more detail.
Google only follows five redirects before treating it as 404 pages. While you shouldn’t have any redirect chains on your own site, easy to spot with something like Screaming Frog, will do the trick.
However, if these are external links pointing to your site, you might want to take a list and run them through Screaming Frog or similar tool or just look at response code in the logs.
5xx Robots File
This is the big one and a few people asked for clarification on twitter.
But basically, if Googlebot gets a 5xx for the robots.txt file it will treat it as a site disallow for 30 days that means Google won’t crawl your site for 30 days.
It will keep checking the robots file and if it comes back then Google will resume crawling your site.
After 30 days it will use the cached version of the robots file. If not file exist will assume its ok to crawl the entire site.
Again another important one, if you have typo’s, incorrect labels in your robots file Google will ignore it.
I will update with a full list of correct spellings, but its quite easy to check.
Kind of makes sense, when coding something if you have to take into account misspelling you could be coding forever.
It’s quite easy to test in the old Google Search Console. I am hoping they might migrate this feature across.
Size of the file
This won’t affect many sites, but there is a limit to the size of the robots.txt file. The size of the file is 500 kibibytes and any info after this Google just ignores.
No more support for noindex
One of the other big announcements is that they are stopping supporting noindex in robots.txt
While this isn’t used that often it was a nice feature to have in our tool kit incase we needed it.
There are plenty of other ways we can stop Google from not indexing content.
While Google is trying to make this a standard, at the moment other search engines and crawlers don’t always support all or some of the above.
Even if Google does make it the standard across the web, it doesn’t mean others have to follow. So if you care about Bing rankings more than Google then some of the above you might want to ignore.
This is evolving as more and more info comes out I will keep analysing and sharing news in our Facebook Group or write more detailed articles when needed.