We all know that robots were created to make life easier (or to make us lazier!), so it stands to reason that the robots.txt file on your website is meant to make someone’s job easier, and that someone is Googlebot.
Getting your robots.txt file right, however, can also make your job easier, if you’re looking to raise your site’s profile in the SERPS.
What this means, essentially, is that the robots.txt file on your website is probably the most important file you have.
If you are already analysing your server logs (as you should be!) you’ll notice that it’s probably either the first or second most visited page by Googlebot.
Last year Google announced that they are upgrading the ‘rules’ regarding Googlebot, your site and your robots.txt file, and making them more official. Below, I’ll explain more about these rules, but first we have to generate a robots.txt – so let’s look at how to do so and what to include.
How to Generate A robots.txt File
One of the most commonly requested search terms is “How to make a robots txt file for my site” and it’s a good thing to know. A robots.txt file is a text file located on your root domain.
What Should Be Included
The file itself need not be too complex. It needs to include two simple things:
Clear instructions to bots
Ideally, you’ll need to analyse your log files to see what bots are crawling your site, blocking any bots that you don’t want to crawl your sites using the robots.txt file – i’ll explain how to do this further down this blog.
Then it’s time to choose what parts of your site you DON’T want them to crawl, excluding these at page, section or folder level – it’s entirely up to you.
You can use wildcard instructions in your robots file to stop them crawling certain sections.
Here’s an example of how to do so:
User-agents: *
Disallow: https://onpage.rocks/do-not-crawl/*
The above instruction would stop any bot from crawline any of the pages in that folder. (You can call your folder something fancier if you like – keeping it simple works for me!)
Sitemaps
LIsting your sitemap URLs here is a good way to ensure that Google (and other search engines) can find your files.
I’d recommend including them ALL here – it’s good practice to do so.
For example:
Sitemaps:
https://onpage.rocks/sitemaps.xml
There are, of course, several other ways to let Google know about your sitemaps. However, this one helps all search engine and other bots find your files.
What Shouldn’t Be Included
noindex
You might already be aware that in the past, using noindex in your robots.txt file worked, but for years Google have been telling us that this was not a good strategy. Some webmasters and SEOs simply didn’t listen to this, as it was working for them back then, but last year’s Google update made it clearer than ever that it would not work as a way of not getting content indexed going forward. So don’t try it.
Now we’ve gone through what you should and shouldn’t include in your robots.txt file, let’s get to the complex way to go about creating it….stay with me here….
Firstly, open any text editor on your machine.
Next, write the file.
That’s it.
It really IS that simple.
Where To Upload A robots.txt File
What’s in a Name?
Where and how you upload your robots.txt file is crucial. But firstly, you need to ensure that it’s accurately named.
I’m going to be crystal clear about this. The file name you need is robots.txt
That’s it. Absolutely no margin for error.
It can’t be named robot.txt or Robots.txt – only robots.txt will do.
You’d be surprised at how easy it is to misspell such a simple filename when you’re not concentrating, so check and check again before you do anything further. This is the ONLY file format that bots will look for.
Location, Location, Location
Your robots.txt file must also be located in the root file – for example, for this site it’s onpage.rocks/robots.txt
Let me be very clear again. (although i’m beginning to sound a little bossy, it’s for your own good)
Do not store robots.txt in a subfolder
Some people try and store the file in a sub folder so onpage.rocks/files/robots.txt – but you’d be wasting yours and Googlebot’s time with this – as they won’t find and use it there.
How To Check robots.txt
Using A robots.txt checker
Using a robots.txt checker can be a really handy way to make sure you’re doing everything right and there are several great robot testers out there. Google offers a very good tester within Google Search Console. But because it’s Google, it checks for Google Bots only. Another good one can be found here. Some of the actions you could take with such a checker include:
Check if a URL is blocked by robots.txt
The great thing is that the robots.txt file allows you to be very specific and block sections, pages or even whole folders. So, with most tools you can enter a URL and choose a bot and it will tell you if it’s allowed to be crawled.
Block all search engines robots txt
Why would you want to block all Search engines from crawling your site? After all, if they can’t crawl your site they can’t index it.
This might seem like a stupid idea on the face of it, but bear with me.
There are some legitimate cases where you might not want bots from crawling your site.
The most common reason for this is the site in question is your staging/development site.
To do this add the following code to your robots file:
User-agent: *
Disallow: /
FAQ about robots.txt
There are some questions that I hear often about robots.txt files so I thought I’d take a little time to go through them. You might already know some of this stuff, but some might come as a surprise.
Do Robots have to follow the instructions?
Most friendly bots will honour the robots.txt file. After all, it’s in their best interests to do so. However, not all bots are created equal. Scrapper and other potential black-hat-type bots don’t even look at the robots file and usually just ignore it.
What happens if I don’t have a robots file?
Google has confirmed that if you don’t have a robots file, that they will assume that its ok to crawl your entire site. This means they could end up crawling sections of the site you don’t want them to and this could waste potential crawl budget – which is not a good thing.
How to block Screaming Frog from crawling my site?
You might notice when checking your logs that you are getting crawled a lot from Screaming Frog. If it’s not you that’s doing this, it’s probably your competition crawling your site – sneaky but true! Blocking Screaming Frog could free up server resources, so you might want to consider it.
Blocking the bot is easy:
User-agent: Screaming Frog SEO Spider/12.6
Disallow: /
One piece of bad news, however, is that it’s very easy to get around. There is an option to ignore the robots.txt file – and it’s quite easy to modify for the useragent too.
Blocking Ahrefs, SEMRush or Moz from crawling your site
Again this is pretty easy to do. All you need to do is find their useragents and block them. But, and it’s a big but, if it’s that actual tools themselves doing the crawling (for finding new backlinks) they will honour it. If its competitors crawling you then again, they can change the useragent to crawl your site, so you’d have to find and block them again.
Hopefully, this should have answered some of the questions you may have had about how crucial it is to have a robots.txt file, how to make sure your file is found, and how to make sure it contains the information you need to help the bots crawl better.
After all, it’s in your site’s best interests to make life easier for the bots.