Follow Us

post image

Crawl Stats: The Average Crawl Response & Purposes for E-Commerce

There are plenty of metrics that search engine optimization (SEO) experts use to gauge website performance

These metrics, such as organic traffic and bounce rate, can be ranking factors for search engine results pages (SERPs). That’s only the case, however, if these pages are being properly crawled, indexed, and ranked.

So, how can you be sure that’s even the case? With crawl stats.

In this post, I’ll pull back the curtain on how crawl stats function. I’ll cover how crawlbots are crawling your site and, more importantly, how your site is responding. With this information, you can then take steps to improve crawlbot interactions for better indexing and ranking opportunities.

Crawl Response Key Findings

Crawl response refers to how websites respond to crawlbots.

Web crawlers, like crawlbot, analyze the robots.txt file and XML sitemap to understand which pages to crawl and index.

NP Digital analyzed 3 e-commerce clients (Client A, B, C) using the Google Search Console (GSC) Crawl Stats report.

OK (200) status URLs dominate, followed by 301 redirects.

The average HTML file type is 50%, and average JavaScript is 10%.

Average purpose breakdown: 33% discovery, 67% refresh.

We recommend these best practices based on this analysis:

Reduce 404 errors by creating appropriate redirects.

Choose the correct redirect type (temporary or permanent) and avoid redirect chains.

Evaluate the necessity of JavaScript file types for better crawl performance.

Use crawl purpose percentages to ensure effective indexing after website changes.

What Is Crawl Response and What Is Its Purpose?

As an SEO expert, you likely know the basics of website crawling, indexing, and ranking; but did you ever wonder how websites respond to crawlbots? This is known as crawl response.

More specifically, a crawl response is the response that a web crawler, or crawlbot, receives from any given URL on your website. Crawlbot will initially go towards the robots.txt file of a given website. Typically, an XML sitemap is located within the robots.txt. The crawler then understands which pages should be crawled and indexed, vs which should not. The sitemap then lays out ALL of the website’s pages. From there, the crawler heads to a page and begins analyzing the page and finding new pages via hyperlinks.

When the crawlbot reaches out to your web client with a page request, the web client contacts the server, and the server “responds” in one of a few ways:

OK (200): This indicates the URL was fetched successfully and as expected.

Moved permanent (301): This indicates the URL was permanently redirected to a new URL.

Moved temporarily (302): This indicates the URL was temporarily redirected to a new URL.

Not found (404): This indicates the request was received by the server, but the server couldn’t find the page that was requested.

There are other possible responses, but the above are the most common.

Now, how about purpose?

Crawl purpose is the reason why Google is crawling your site. There are two purposes: discovery and refresh.

Discovery happens when a crawl bot crawls a URL for the first time. Refresh happens when a crawlbot crawls a URL after it was previously crawled.

Within the GSC Crawl Stats report, purpose is calculated as a percentage. There is no good or bad percentage for either purpose type. However, you should use this section as a gut check against your website activities.

If you’re a new website that is publishing tons of new content, then your discovery percentage is going to be higher for the first few months. If you’re an older website that is focused on updating previously published content, then it makes sense that your refresh percentage would be higher.

This crawl data plus file type, are all available in GSC for you to use to your advantage. Fortunately, you don’t have to be a GSC professional to get the most out of this tool. I created this GSC expert guide to get you up to speed.

Crawl Response and E-Commerce: Our Findings

Sometimes, it’s not enough to know how your website is performing. Instead, it helps to compare it to other websites in your industry to get an idea of the average.

That way, you can compare your website to the competition to see how it stacks up.

So how can you do that with an eye towards Google crawling activities? With the Google Search Console Crawl Stats report!

Let me clarify: You can only analyze websites on GSC when you own it or have access to the backend. However, my team at NP Digital has done the heavy lifting for you. We’ve analyzed three of our clients’ top-ranking e-commerce websites to determine the average crawl response and crawl purposes.

You can use the information we gleaned to compare it to your own website’s GSC crawl stats report and see how you measure up.

So, what did we find?

Client A

First up is a nutritional supplement company based in Texas in the United States.

By Response

When looking at the breakdown by response for Client A, it’s a rather healthy mix.

200 status OK URLs are the largest response, by far, at 78 percent. This means that 78 percent of the crawled URLs responded successfully to the call from the crawlbot.

One thing to note here is that 200 status OK URLs can be indexed and noindexed. An indexed URL (the default) is one that crawlbots are encouraged to both crawl and index. A noindexed URL is one that crawlbots can crawl, but they will not index. In other words, they won’t list the page on Search Engine Results Pages (SERPs).

If you want to know what percentage of your 200 status OK URLs are indexed versus noindexed, you can click into the “By response” section in GSC and export the list of URLs:

You can then bring that list over to a tool like Screaming Frog to determine the amount of indexed versus noindexed URLs in your list.

Perhaps you’re asking, “why does that matter?”

Let’s say that 200 status OK URLs make up 75 percent of your crawl response report with a total number of 100 URLs. If only 50 percent of those URLs are indexed, that considerably cuts down the impact of your URLs on SERPs.

This knowledge can help you to improve your indexed URL portfolio and its performance. How? You know that you can reasonably impact just 50 percent of those 100 URLs. Instead of measuring your progress by analyzing all 100 URLs, you can narrow in on the 50 that you know are indexed.

Now on to the redirects.

Nine percent of the URLs are 301 (permanent) redirects, while less than one percent are 302 (temporary) redirects.

That’s an almost 10 to 1 difference between permanent and temporary redirects, and it’s what you would expect to see on a healthy domain.

Why?

Temporary redirects are useful in many cases, for example, when you’re performing split testing or running a limited-time sale. However, the key is that they are temporary, so they shouldn’t take up a large percentage of your responses.

On the flip side, permanent redirects are more beneficial for SEO. This is because a permanent redirect tells crawlbots to index the newly targeted URL and not the original URL. This reduces crawl bloat over time and ensures more people are directed to the correct URL first.

Last, let’s look at 404 URLs. For this client, they are only three percent of the total responses. While the goal should be zero percent, this at scale is typically very hard to achieve.

So if zero percent 404 URLs is unlikely, what can you do to ensure the customer still has a good experience? One way is by creating a custom 404 page that displays similar options (e.g., products, blog posts) for the visitor to go to instead, like this one from Clorox:

By File Type

Let’s not forget to consider the requests by file type. That is, the file type in which the URL responds to the crawlbot’s request.

A large amount (58 percent) of the site files for Client A are HTML. You’ll notice that JavaScript is clearly present, too, with 10 percent of requests being answered by a JavaScript file type.

JavaScript can make your site more interactive for human users, but it can be more difficult for crawlbots to navigate. This may hinder performance on SERPs which is why JavaScript SEO best practices must be followed for optimal performance and experience.

By Purpose

Finally, let’s look at the requests by purpose.

In Client A’s case, 13 percent of the crawl purpose is discovery with the remaining 87 percent being labeled refresh.

Client B

Next up is a natural artesian water brand based in California, United States.

By Response

Similar to Client A, the majority (65 percent) of Client B’s response type are 200 status OK URLs. However, the difference between the OK status URLs and redirects is not as large as one would want it to be.

Of the redirects, 19 percent are 301 (permanent) and one percent are 302 (temporary). That’s still a healthy balance between the two, though 20 percent of URL responses being redirects is quite high.

So, what can Client B do to ensure the redirects aren’t negatively impacting crawl indexing or user experience?

One thing they can do is ensure their 301 redirects don’t include any redirect chains.

A redirect chain is just what it sounds like—multiple redirects that occur between the initial URL and the final destination URL.

The ideal experience is just one redirect, from Page A (source URL) to Page B (target URL). However, sometimes you can get redirect chains that mean Page A goes to Page B which goes to Page C, and so on. This may confuse the visitor and slow page load times.

In addition, it can confuse crawlbots and delay the crawling and indexing of URLs on your website.

So, what’s the cause of redirect chains?

It’s most often an oversight. That is, you redirect to a page that already has a redirect in place. However, it can also be caused during website migrations. See the graphic below for an example:

By File Type

Now let’s consider the crawl by file type.

Client B has quite a high percentage of “Other” file types at 23 percent. There’s nothing inherently wrong with the “Other” file type assuming you know what those file types are. The “Other” file type just means anything outside of the other defined file types, and it can even include redirects.

However, combined with the 12 percent “Unknown (failed requests),” it’s something for the client to dig into and resolve.

By Purpose

The breakdown of purpose for Client B is 90 percent refresh and 10 percent discovery.

As mentioned above, there is no right and wrong breakdown here. However, with such a high refresh crawl rate, it would be a good idea to ensure that your pages are optimized for the next crawl. How? First is to clean up 404 errors. Set up redirects, preferably 301s.

When doing so, be sure the 301 redirects are not chained. If current redirects exist, just be sure to break that relationship before creating the new 301 for that URL.

Client C

The third and final client we analyzed is a food gift retailer based in Illinois, United States.

By Response

Similar to Clients A and B, the majority (68 percent) of Client C’s response types are 200 Status OK URLs.

Where we veer into new territory is with Client C’s 404 Not Found URLs, which are a whopping 21 percent of their total response types to crawlbots.

Why might this be the case?

The most likely culprit is simple oversight.

When a page is moved or deleted, as so happens from time to time, a 301 or 302 redirect must be set up to direct traffic elsewhere. These moved or deleted pages tend to happen on a smaller scale, like when a product is no longer sold by a company. As an e-commerce brand, learning to deal with out-of-stock or discontinued products requires tactical precision and alignment between sales and marketing.

However, a website domain transfer can cause this to happen on a much larger scale.

Not all domain transfers occur within a one-to-one framework. By that, I mean that your new site’s structure may not match your old site’s structure exactly. 

Let’s say your old website had category pages as part of its structure, but the new site doesn’t. Even though there’s not a one-to-one URL redirect, you still need to redirect those URLs. Or else, you get a large number of 404 errors:

Even within a one-to-one framework transfer, though, the redirects must be set up by the website owner.

Speaking of redirects, Client C does have some permanent redirects established. They make up 10 percent of the site’s response types. As for temporary redirects, those make up less than 1 percent of the response types.

By File Type

Jumping into the file type breakdown, Client C has a higher percentage of JavaScript file types than the other two clients. The JavaScript file type is 13 percent of requests. “HTML” (43 percent) and “Other” (12 percent) are the other major file types being crawled.

A reminder here that JavaScript file types can be more difficult for crawlbots to crawl and index. So in advising Client C, I would recommend they investigate those JavaScript file types and keep only what is required.

By Purpose

Last but not least, let’s look at the By Purpose breakdown for Client C.

Client C has an 83 percent refresh rate which is the lowest of the three clients, though not outside the “norm.” This simply indicates that Client C is currently publishing more new content than Clients A and B.

Again, it wouldn’t be a bad idea for Client C to evaluate their redirects (especially looking out for redirect chains). In the case of Client C, they should also focus heavily on correcting those 404 errors.

The Average Crawl Responses, File Types, and Purposes

Now that we’ve analyzed each client, let’s take a look at the averages across the board:

And the e-commerce crawl stats averages by purpose:

Looking at the average crawl stats, OK (200) status URLs are the core response type. 301 redirects are next, and that’s not surprising in e-commerce, where products and collections are often phasing in and out.

One “surprise” here is that the average rate of HTML file types is 50 percent, which is lower than our team anticipated. However, its edge over JavaScript is to be expected, considering the issues that crawlbots have with JavaScript files.

Insights From the Crawl Response of These E-Commerce Companies

We’ve delved into three e-commerce websites and discovered how Google is crawling their sites and what they’re finding.

So, how can you apply these learnings to your own website?

Cut down on 404 responses. You should first determine whether it’s a true 404, or a soft 404. You can then apply the correct fix. If it is a true 404 error, you should create the appropriate redirect. If it is a “soft” 404, you can work to improve the content and reindex the URL.

Create smart redirects. If you must create a redirect, it’s important that you choose the correct one for the situation (temporary or permanent) and that you ensure there is no redirect chaining. 

Evaluate the necessity of JavaScript file types. Crawlbots may have trouble crawling and indexing JavaScript file types, so revert to an HTML file type when possible. If you must use JavaScript, then enabling dynamic rendering will help to reduce crawl load significantly.

Use crawl purpose to gut-check your site’s indexing activities. If you recently made changes (e.g., added new pages, updated existing pages) but the corresponding purpose percentage hasn’t budged, then be sure the URLs have been added to the sitemap. You can also increase your crawl rate to have Google index your URL more quickly.

With the above efforts combined, you’ll see a marked improvement in your crawl stats.

FAQs

What are crawl stats?

Crawl stats are information that helps you to understand how crawlbots crawl your website. These stats include the number of requests grouped by response type, file type, and crawl purpose. Using the GSC Crawl Stats report, you can also see a list of your crawled URLs to better understand how and when site requests occurred.

Conclusion

If your URLs aren’t being properly crawled and indexed, then your hopes of ranking are nil. This means any SEO improvements you make to your non-crawled, non-indexed web pages are for nothing. 

Fortunately, you can see where each URL on your website stands with GSC’s Crawl Stats report.

With this crawl data in hand, you can address common issues that may be hindering crawlbot activities. You can even track this performance month-over-month to get a full picture of how your crawl stat improvements are helping.

Do you have questions about crawl stats or Google Search Console’s Crawl Stats report? Drop them in the comments below.

No Comments

Leave a Reply

Back to top