Crawl Budget – What on EARTH is it, and How Can You Make It Work for you?
Modern Search engines discover new content by crawling the web. But, pre-Google, most search engines were just basically a directory list of websites added by webmasters – simple enough, eh?
Well, they were for a while, but for Google and other search engines, this wasn’t good enough. They wanted to crawl the web and show relevant results to the people that matter most – their users.
How Google Crawls Your Pages
Once Google has crawled a page, it then adds it to the queue to be rendered and analysed.
Once this has been done, then Google determines whether it wants to index the content and if so, it analyses the page against 200+ ranking factors to determine where it will appear and against which keywords.
Sounds Fairly Simple Still, So What’s the Problem?
Most tech audits only do a crawl of the website, analyse it and completely ignore the log files. This includes both custom audits and the usual SEO tools. This, however, can be a popular (and prett big) mistake for large websites because it can mean your crawl budget isn’t optimised.
What is Crawl Budget? And Why Does it Matter?
Google’s Gary Illyes noted there’s no “official” definition of crawl budget in a Google Webmaster blog post:
“Recently, we’ve heard a number of definitions for ‘crawl budget’, however, we don’t have a single term that would describe everything that ‘crawl budget’ stands for externally”, he reveals.
He goes into detail and breaks it down into two key sections – which we’ll explain below:
Crawl Rate Limit
This, in its basic form is the number of simultaneous parallel connections Googlebot can use to crawl your website. If the site responds quickly for a while the limit goes up. But, if the site responds slowly, the limit decreases.
Webmasters can decrease the limit in Google Search Console, which can be useful if you are having server issues, but you cannot request higher limits directly from Google.
Crawl Demand
While Google’s motto is to “organise the world’s info” it just doesn’t have the resources to crawl every page every single day – who would?
Gary Illyes revealed that, while Google wants to crawl popular URLs, and these are given a priority, at the same time Google doesn’t want a stale system, so stale content doesn’t get crawled as frequently.
What is my site’s crawl budget?
This is a common question. Most people look in Google Search Console for the answer, and while they do have a section on Crawl stats, over the last 5 years I have never got the data to match up – which is kind of frustrating.
Later in this article, I will cover what the data holds, but if you really want to see your data you have got to look in your log files. This guide will tell you how to download your log files.
If you want to see the trend, however, you’ll want to chart the daily crawl by both mobile bot and desktop bot, ideally over an extended period. It takes a little work, but it’s certainly worth it.
What impacts crawl budget
Inefficient use of crawl budget can lead to the following issues
- A decrease in your crawl budget
- Which will lead to a decrease in new pages and updated content being indexed
- Which ultimately leads to a decrease in rankings, traffic and revenue
Which, let’s face it – is exactly what we don’t want to happen.
But How Do You Know What’s Impacting Your Crawl Budget?
According to Gary’s own article, there are several factors, but a key one is having a lot of low-value URLs on your site. So, it is worth doing a content audit on your site and removing weak and old pages. There are other factors too, most of which can be found in Google Search Console as well as your log files. Let’s take a look at some of these.
Faceted Navigation and session identifies
This is usually for e-commerce sites that want to allow customers to filter products to make it easier. (While this does amazing things for conversion it can create a large number of weak pages with near duplication).
This guide on setting up faceted navigation is good to follow, But, if you don’t fancy going through it right now, it’s probably best to just block Googlebot from crawling these pages.
Onsite Duplicate content
One of the common reasons for this is having both the secure and non-secure versions of the site live. Having tags and categories with similar names on a blog type site creates very similar pages with almost duplicate content. It can also happen with faceted navigation.
Soft Errors
These are pages which you return as 200 but really, they are dead pages. Top tip – you really don’t want ANY of these. They can be quite difficult to spot in the logs. But luckily, Google highlights these in the Google Search Console – saving you a fair bit of detective work.
Hacked Pages
Sounds obvious, but just to make it really clear – Google doesn’t want to show hacked websites in Google results. This is why keeping on top of website security is important.
Avoid Infinite Spaces
This is basically where there is no end to a link. While I haven’t come across many examples of this in the last 10 years, Google gave a great example. A site which has a calendar on it, with a link to “Next Month” which doesn’t end. We may love the idea that time goes on forever, but Googlebot won’t, so this could impact heavily on your crawl budget.
Low Quality and Spam Content
As mentioned above, Google doesn’t have unlimited resources – even though some of us think it must have – and its aim is to show the best results. If your site has a lot of thin or spammy content, then Google will lose trust in your site and crawl less frequently.
How can I audit my website’s crawl budget?
As mentioned above, you need to get access to your server logs. Once they are converted, you need to analyse your data.
There are several tools on the market, but our tool starts at just $9.99 a month and gives you all the basic information you could need to find out how Googlebot is getting on with crawling your site.
The things you should be looking for are:
- Which are the most visited pages
- Which are the least visited pages (or pages with no visits)
- Errors – what errors are they finding?
- Orphan pages
- Trend of activity
Once these have been identified, you need to start fixing the errors and encouraging Google to visit more of the website.
How can I increase my website’s crawl budget?
So, by now, hopefully, you’ll understand what is crawl budget, and what impacts it. But, now to the question you’re really going to want answered. I say this, as when I speak at conferences, the next question is always, always – ‘How I can I increase my crawl budget?’
Gary kind of covered some reasons. But, let me cover some more, as there are quite a few things you can do on-page and off-page to boost your crawl budget.
Reduce Errors
I covered finding these errors above, but once you have found them, they need fixing – and pretty urgently too. So, let’s put down that coffee and get to work – you certainly don’t want Google hit errors – it’s a complete waste of resources.
Some of the errors you will find via a crawl, and sometimes you can fix the internal links. But, if Google is following external links, these can be found in your log files. Don’t just rely on crawling your own site.
Some places that go overlooked are broken links in sitemaps. This is an important file for bots, and broken or redirecting links could cause a lot of wasted crawl budget.
Use your robots file
This is a really important file on your site. It’s there to give instructions to bots about what they can and cannot do on your website – think of it as a list of rules to make sure a bot behaves in a manner you’d like it to.
A small minority of sites I audit don’t have a file at all, or just use the default file that is installed with WordPress. This is often simply not good enough.
Actions
Next, analyse your logs and if you can see they are crawling pages which are not important, such as:
- Faceted pages
- Search Pages
- Forum pages
- Dynamic URLs
It’s worth noting here that I used to recommend that you excluded PPC pages from Googlebot as there is a specific Google Ads Bot. However, for shopping pages it actually uses the normal crawler, so you do need to allow Google to crawl these pages.
Check out our handy guide on editing your robots file.
Decrease your site speed
Firstly, this affects more than just your crawl budget – it also affects your Quality Score for PPC AND the conversion rate for users – but slow sites REALLY impact crawl budget.
Google themselves have said this is a major factor in reducing crawl budget.
Actions
- Analyse the page speed at page level (Google doesn’t take into account a site score – they measure it at a URL level)
- Identify slow-loading pages and templates and either fix them yourself or hire a developer
- Look at your server set up
- Upgrade your server to support HTTP2
For more information on page speed, check out guides on improving your site speed as well as great tools for measuring speed.
Reduce redirect chains
This will show as a 508 Response code, but these are actually one of the worst elements for wasting crawl budget. Firstly, Google will hit each URL in the chain and use part of your sites allotted allowance, but if they keep constantly hitting these, they reduce the crawl budget and crawl less frequently.
These can be spotted in a crawl of your site and also your logs. It’s worth checking the logs as the problem could be an old redirect chain you have removed from the site but that is still happening, and you wouldn’t be any wiser without looking in the log files.
Actions
- Regularly check your logs and crawls for 508 and put as a high priority to fix
Build your authority
Simply put – get more backlinks to your site. If Google sees your site is more popular and gaining natural links from across the web, it’s going to spend more time crawling your site.
Create shareable content
As Google mentioned, they will crawl more popular pages, so if your content is being shared this is a good sign. One handy reminder, though – as well as creating shareable content, make it easy for people to share by including Facebook and Twitter share buttons.
In Summary:
Server Log Auditing is usually done by SEOs or marketeers, which means a lot of businesses are potentially missing out on getting their content indexed sooner by simply looking at the basics.
If you would like to analyse the basics, check out our analyser – we think it’s great, but don’t just take our word for it!
Or, if you would prefer a more in-depth bespoke audit completing please get in touch.
Some commonly asked questions
Where are server logs stored?
A lot of people are unsure of where to find the logs. Each site setup is slightly different, so we should mention that this is more generic advice. If you want a specific answer, drop me a message.
Server logs are usually stored in the Public folder on your server under a subfolder called logs.
What are web Server Logs?
Every time someone or something (i.e. bots) requests a page or resource from your website, this request is stored. These are then stored in a file and usually batched each night into a ‘day’ file.
How to read server logs?
This is tricky. If you are taking them directly from the server they come in a file format which isn’t readable by most common programmes. However, there are plenty of great tools out there (see here for what we think is a pretty great one) which can convert your log files into a CSV file for you to audit.
How to get server Logs?
This is a common question and it depends on your setup. If you have a full development team and no access to the server you may need to speak with them. But, if you are a one-man-band and do everything then you will have to log onto your host to retrieve the files – and then get yourself another coffee – you’ve earned it.
Sever Logs in Cpanel?
A lot of websites use Cpanel and if this is yours, then getting the logs has never been easier. Read this handy guide on getting your log files from Cpanel.
Server Logs and GDPR?
This is a serious matter and you should contact a local specialist to help you, but generally speaking, server logs do fall under GDPR because the logs contain the users’ IP addresses.
Some logs settings allow you to remove the IP address which could help, but I have here had to change a few processes since GDPR was introduced to make Onpage Rocks compliant – and so might you – it’s never worth falling foul of the law.
Server Logs vs Error Logs?
One contains every request (server logs) and the other (error logs) just record when an error response was returned.
I personally only look at the server logs, as they contain everything, but if you have a large file and are only interested in the errors then the error log will give you that data.
Server Logs vs Google Analytics?
These two tools offer a completely different set of measurements and both should be used. It can be any analytics platform, but GA is the most used.
Google Analytics gives you much more detailed information than what users are doing on your site and offers more detail information about them. But, the logs will give you much more detailed information on what bots are doing on your website; something which analytics platforms generally can’t tell you.
How long are server logs stored for?
90 Days. That’s the amount of data I like to trend when analysing logs. But, if you’re looking for errors the past 30 days is key. It can be pointless looking at errors from 90 days ago as (fingers crossed) you have already fixed them, but in terms of trending different bot activity, I like to use the previous 90 days.
Share:
More Great Articles
Crawl Budget – What is it and how to optimise for it
Modern Search engines discover new content by crawling the web, pre-Google most search engines were just basically a directory list of websites added by webmasters, but for Google and other search engines, this wasn’t good enough they wanted to crawl the web and show relevant results.
Once Google has crawled a page, it then adds it’s to the queue to be rendered and analysed.
Once Google has done this, then it determines whether it wants to index the content and if so, it analyses the page against 200+ ranking factors to determine where it will appear and against which keywords.
Most tech audits only do a crawl of the website and analyse that and completely ignore the log files, this includes both custom audits and the usual SEO tools, this, however, can be a popular mistake for large websites.
Inefficient use of crawl budget can lead to the following issues
- A decrease in your crawl budget
- Which will lead to a decrease in new pages and updated content being indexed
- Which ultimately leads to a decrease in rankings, traffic and revenue
What is Crawl Budget?
Google’s Gary Illyes noted there’s no “official” definition of crawl budget in a Google Webmaster blog post:
“Recently, we’ve heard a number of definitions for ‘crawl budget’, however, we don’t have a single term that would describe everything that ‘crawl budget’ stands for externally.”
He goes into detail and breaks it down into two key sections
1. Crawl Rate Limit
This basically is the number of simultaneous parallel connections Googlebot can use to crawl your website. If the site responds quickly for a while the limit goes up, if the site responds slowly the limit decreases.
Webmasters can decrease the limit in Google Search Console, which can be useful if you are having server issues, but you can not request higher limits directly from Google.
2. Crawl Demand
While Google’s motto is to “organised the world’s info” it just doesn’t have the resources to crawl every page every single day.
Gary Illyes said it wants to crawl popular URLs, so these are given a priority at the same time Google doesn’t want a stale system, so stale content doesn’t get crawled as frequently.
What is my site’s crawl budget?
This is a common question, most people look in Google Search Console for the answer, and while they do have a section on Crawl stats, over the last 5 years I have never got the data to match up. Later in this article, I will cover what the data holds, but if you really want to see your data you have got to look in your log files. This guide will tell you how to download your log files.
But you want to chart the daily crawl by both mobile bot and desktop bot, ideally over an extended period to see the trend.
What impacts crawl budget?
According to Gary’s own article, there are several factors, but a key one is having a lot of low-value URLs on your site. So it is worth doing a content audit on your site and removing weak and old pages. There are other factors too, most of which can be found in Google Search Console as well as your log files.
Faceted Navigation and session identifies
This is usually for e-commerce sites, that want to allow customers to filter products to make it easier (it does amazing for conversion), but it can create a large number of weak pages with near duplication.
This guide on setting up faceted navigation is good to follow, but if you don’t its probably best to just block Googlebot from Crawling these pages.
Onsite Duplicate content
One of the common reasons for this is having both the secure and non-secure versions of the site live, having tags and categories with similar names on a blog type create very similar pages with almost duplicate content. It can also happen with faceted navigation.
Soft Errors
These are pages which you return as 200 but really they are dead pages, you really don’t want any of these. Quite difficult to spot these in the logs, but luckily Google high lights these in Google Search Console.
Hacked Pages
These are obvious, Google doesn’t want to show hacked websites in Google results. This is why keeping on top of website security is important.
Avoid Infinite Spaces
This is basically where there is no end to a link, while I haven’t come across many examples in the 10 years but Google gave a great example. A site which has a calendar on it, with a link to “Next Month” which doesn’t end.
Low Quality and Spam Content
As mentioned above, Google doesn’t have unlimited resources and its aim is to show the best results. If your site has a lot of thin or spammy content then Google will lose trust in your site and crawl less frequently.
How can I audit my website’s crawl budget?
As mentioned, above you need to get access to your server logs and once converted you need to analyse your data.
There are several tools on the market, but our tool starts at just $9.99 a month and gives you all the basic information you will need.
The things you should be looking for are:
- Which are the most visited pages
- Which are the least visited pages (or pages with no visits)
- Errors – what errors are they finding
- Orphan pages
- Trend of activity
Once these have been identified, you need to start fixing the errors and encouraging google to visit more of the website.
How can I increase my website’s crawl budget?
So by now, you understand what is crawl budget, and what impacts it, but whenever I speak at conferences, the next question is always – how I can I increase my crawl budget.
Gary kind of covered some reasons, but let me cover some more as there are quite a few things you can do on-page and off-page to boost your crawl budget.
Reduce Errors
I covered above about finding these errors, but once you have found them they need fixing and pretty urgently. You don’t want Google hit errors – its a waste of resources.
Some of the errors you will find them via a crawl, but sometimes you can fix the internal links but if Google is following external links these can be found in your log files. Don’t just rely on crawling your own site.
Some places which go overlooked are broken links in sitemaps, this is an important file for bots and broken or redirecting links causes a lot of wasted crawl budget.
Use your robots file
This is a really important file on your site, it there to give instructions to bots about what they can and can not do on your website.
A small minority of sites I audit don’t have a file at all, or just use the default file that is installed with WordPress.
Actions
Analyse your logs and if you can see they are crawling pages which are not important
- Faceted pages
- Search Pages
- Forum pages
- Dynamic URLs
It’s worth noting here that I used to recommend that you excluded PPC pages from Googlebot as there is a specific Google Ads Bot – however for shopping pages specifically it uses the normal crawler so you do need to allow Google to crawl these pages.
Check out our handy guide on editing your robots file.
Decrease your site load time
Firstly this affects more than just your crawl budget, it affects Quality Score for PPC, the conversion rate for users – but slow sites really impact crawl budget.
Google themselves have said this is a major factor in reducing crawl budget.
Actions
- Analyse at page level the page speed (Google doesn’t take into account a site score, they measure it at a URL level)
- Identify slow-loading pages and templates and either fix them yourself or hire a developer
- Look at your server set up
- Upgrade your server to support HTTP2
For more information on page speed, check out guides on improving your site speed as well as great tools for measuring speed.
Reduce redirect chains
This will show as a 508 Response code, but these are one of the worst elements for wasting crawl budget. Firstly Google will hit each URL in the chain and use part of your sites allotted allowance, but if they keep constantly hitting these they reduce the crawl budget and crawl less frequently.
These can be spotted in a crawl of your site and also your logs – it’s worth checking the logs, it could be an old redirect chain you have removed from the site – but still happening and you wouldn’t be any wiser without looking in the log files.
Actions
- Regularly check your logs and crawls for 508 and put as a high priority to fix
Build your authority
Simply put get more backlinks to your site. If Google sees your site is more popular and gaining natural links from the across the web, it’s going to spend more time crawling your site.
Create shareable content
As Google mentioned they will crawl more popular pages, so if your content is being shared this is a good sign, so as well as creating shareable content, make it easy for people to share. Include Facebook and Twitter share buttons – don’t add extra steps in for your users make it easier.
In Summary:
Server Log Auditing is fairly done by SEO’s or marketeers, which means a lot of businesses are potentially missing out on getting their content indexed sooner, by simply looking at the basics.
If you would like to analyse the basics, check out our tool or if you would prefer a more in-depth bespoke audit completing please get in touch.
Some commonly asked questions
A lot of people are unsure of where to find the logs. Each site set up is slightly different so this is more generic advice, if you want a specific answer, drop me a message.
They are usually stored in the Public folder on your server under a subfolder called logs.
Every time someone or something (i.e. bots) requests a page or resource from your website, this request is stored. These are then stored in a file and usually batched each night into a ‘day’ file.
This is tricky if you are taking them directly from the server they come in a file format which isn’t readable by most common programmes. There are plenty of great tools out there which can convert your log files into a CSV file for you to audit.
We offer this tool, but there are plenty out there.
This is a common question and it depends on your set up. If you have a full development team and no access to the server you may need to speak with them, if you are a one-man-band and do everything then you will have to log onto your host to retrieve the files.
A lot of websites use Cpanel and getting the logs have never been easier. Read this handy guide on getting your log files from Cpanel.
This is a serious matter and you should contact a local specialist to help you, but generally speaking, server logs do fall under GDPR because it contains the users IP address.
Some logs setting allow you to remove the IP address which would help, but I have here had to change a few processes since GDPR was introduced to make Onpage Rocks compliant.
90 Days, that’s the amount of data I like to trend when analysing logs, the past 30 days is key for errors, it can be pointless looking at errors from 90 days ago as hopefully, you have already fixed them, but in terms of trending different bot activity, I like to use the previous 90 days.