Google Analytics Data Sampling: What You Need to Know for GAU & GA4
What if I told you that your Google Analytics reports may not be as accurate as you assumed them to be?
That’s because Google Analytics uses data sampling—a method that reduces workload but increases the risk of inaccurate results—on certain reports in certain situations.
The good news is that you don’t need to be a Google Analytics pro to understand sampling and how it can impact your reports and data quality.
In this post, I’ll explain what Google Analytics data sampling is, why it’s used, and how it works. I’ll also share problems that Google Analytics sampling can cause for your reports including an example from an NP Digital client. Of course, I’ll also share ways that you can avoid and manage data sampling in GA.
If you’re ready to learn more about GA4 data sampling than you ever thought you’d need to know, read on.
Key Findings on Google Analytics Data Sampling
There are two types of data sampling in Google Analytics: session sampling (implemented in ad-hoc reports) and data-collection sampling (which occurs before data is sent to Google Analytics). This is designed to reduce processing time.
The primary drawback is the potential loss of accuracy, affecting both small and large websites.
Comparing sampled and un-sampled data for a client at NP Digital showed variations in reported numbers.
Applying the same regex as a segment (sampled) and as an advanced filter (un-sampled) resulted in different year-over-year outcomes.
To ensure accurate comparisons, it is recommended to export sampled data from GA4 if large data sets are likely to be sampled.
Reducing the date range is an effective way to avoid sampling, as fewer sessions fall below the account threshold.
Utilizing default reports without any filters or segments is the surefire way to prevent data sampling.
Requesting unsampled results in GA 360 is a workaround, but it comes with considerations like cost and non-real-time, read-only results.
What Is Data Sampling in Google Analytics?
There are two types of data sampling in Google Analytics.
The first type is session sampling, and this is implemented within Google Analytics ad-hoc reports after session data has been collected.
The second type is data-collection sampling, where the data collected by your website or app is just a sample of the entirety of hits your property has received. This occurs before the data is sent to Google Analytics, so only the sample data and not all data is stored by Google Analytics.
The main benefit of data sampling is faster reporting. You’re either analyzing less data (with session sampling) or collecting less data (data-collection sampling), so processing time is reduced.
To understand when and why data sampling can occur, it’s important to understand the two different categories of reports in GA4 (and previously available in GAU): default reports and ad-hoc reports.
Default reports are those that appear in the left navigation bar of your GA4 property:
When you run these reports as-is (i.e. no segments or filters added), Google Analytics pulls from its aggregated data tables to provide results. Within default reports, sampling does not occur.
Ad-hoc reports are either default reports with segments, filters, or secondary dimensions, or they’re custom reports with dimensions and metrics that don’t exist in default reports. Ad-hoc reports are subject to sampling.
When is Google Analytics sampling applied? According to Google, “Ad-hoc queries are subject to sampling if the number of sessions for the date range you are using exceeds the threshold for your property type.”
So what are the limits by property type?
For the Analytics Standard account type, it’s 500K sessions at the property level. For the Analytics 360 account type, it’s 100M sessions at the view level.
Why Google Analytics Sampling Can Be a Problem for Your Reports
With the benefit of faster reporting aside, there are some problems that Google Analytics data sampling can cause for your reports.
The greatest drawback to sampling is the loss of accuracy that occurs. This can occur for both small websites (where the sample size may be too small) and large websites (where the sample size becomes less representative of the average session).
For smaller websites, the risk is that sample sizes will be too small. When the sample size is too small, you will get a poor representation of all the data. This is especially important for performance metrics that are already a small subset of sessions, such as add to cart and conversion rate.
For larger websites, the sample sessions may not be representative of the average user. It’s impossible to control what data is sampled, so outliers may be included. With that said, the larger your website becomes, the more inaccurate your reports are likely to be.
There is one other risk faced by small and large websites alike, and that’s inconsistency across your reports. That is, some reports will utilize un-sampled data while other reports will utilize sampled data. In many cases, this will result in a difference in the numbers you see and ultimately base your decisions on.
Let’s review the data of a client with NP Digital, my digital marketing agency, for an example of such inconsistencies.
How Sampling Can Impact Data: Real-Life Examples
To show the difference in numbers between sampled data and un-sampled data, we compared the same regular expression (regex) used to isolate landing pages when created as a segment versus as an advanced filter.
First is the regex when applied as a segment:
When we apply that segment to analyze organic sessions data from one month (July 1, 2023 – July 31, 2023) compared to the previous year, we get an increase of 13 percent (6,754 vs 5,986) year-over-year (YoY):
When we apply that same regex as an advanced filter, though, we instead get a decrease of 9 percent (6,012 vs 6,600) YoY:
How is that possible?
In the first example, when applying the regex as a segment, we see sampled data. You can identify this by looking at the shield in the top left corner of the page. In this case, the shield is yellow which indicates it’s sampled data. Further, it’s using only 58 percent of sessions to base the data on:
In the second example, when applying the regex as an advanced filter in a default report, we see un-sampled data. This is evidenced by the green shield in the top left corner.
So maybe you’re wondering, how do we know that we’re comparing apples to apples? After all, it’s possible to incorrectly set up a segment or an advanced filter even if they’re seemingly using the same regex.
When we check the segment and the advanced filter over a single month, as opposed to a single month compared to the previous year, the segment and advanced filter numbers match.
Why? Remember that Google Analytics sampling is only applied to ad-hoc queries if the number of sessions for the date range you are using exceeds the threshold for your property type. So when we look at a smaller date range, we don’t exceed our threshold and, therefore, the data is unsampled.
Here is the same regex segment applied to July 2023 only which shows 6,012 sessions:
We can see that the shield is green (indicating unsampled data), and the sessions align with the regex added to the advanced filter.
Perhaps you’re thinking, the regex is large and complex. Is that possibly impacting what we see?
While the above example is for a larger regex, this even happens when applying a simple organic traffic segment. In the below example, the NP Digital team applied the Organic Traffic segment to the Landing Page report:
This exceeded the session threshold and, therefore, the report is now pulling sampled data.
So what’s the takeaway?
If accuracy is important, then don’t apply a segment over large data sets. If the data set exceeds your account threshold, the data will be sampled and this can lead to inaccuracy.
Are you not sure whether your data set will be too large for your property type? Look to the shield. If it’s green, it’s unsampled and you’re getting the full picture. If it’s yellow, it’s sampled and what you’re seeing is just a subset of the data.
With that said, you will want to ensure that the data you export from Google Analytics Universal is sampled.
Why?
When analyzing data, it’s crucial to compare like to like. If we know that large data sets will be sampled in GA4, then the data we export from GAU should be sampled to account for this.
Otherwise, you’ll be comparing GA4 sampled data to GAU un-sampled data, and your comparisons will be inaccurate as a result.
What Can You Do to Avoid Data Sampling in GA?
The best ways to avoid Google Analytics sampling is to reduce your date range and to utilize default reports.
When you reduce your date range, you reduce the number of sessions. If the number of sessions falls below your account threshold, then your custom reports will be unsampled.
The only surefire way to avoid data sampling in GA is to utilize default reports without any filters or segments applied.
Managing Data Sampling
It’s not always possible to avoid data sampling in GA. So what can you do to manage Google Analytics sampling and ensure reporting accuracy? There are a few tricks to consider.
Adjust the Population Size
If a too-small sample size is a risk, then adjusting the population size can help you to gain more accurate results.
The fastest way to adjust the population size is to reduce the date range. This can help you to avoid sampling entirely, or at least ensure that you’re pulling in more accurate data.
Adjust the Level of Precision (GA 360 only)
If you have a GA 360 account, you can use the data quality icon to select one of two options that will impact the sample size. Per Google:
Greater precision: uses the maximum sample size possible to give you results that are the most precise representation of your full data set
Faster response: uses a smaller sampling size to give you faster results
Request Unsampled Results (GA 360 only)
If you have a GA 360 account, you can also request unsampled results.
Do keep in mind that unsampled explorations are not real-time, they are read-only, and they are temporary. They also cost “tokens” to obtain.
So while this isn’t the perfect solution, it’s a good workaround in certain situations.
FAQs
What is Google Analytics data sampling?
Google Analytics data sampling is a method that enables you to collect data even when your GA property threshold has been exceeded. This means that data from a percentage of your website’s sessions are collected and used to provide directionally accurate insights.
How does Google Analytics sampling work?
Google Analytics uses both session sampling and data-collection sampling. With session sampling, all data is collected but the reports are built on just a percentage of that data. With data-collection sampling, just a percentage of data is collected from your website and reports are then created using that.
When does Google Analytics sample data for reporting?
Not all GA reports use sample data. So, when is sample data used? Sample data is used when creating an ad-hoc report that pulls in more sessions to analyze than your property threshold allows. You may also see sample data on default reports when using segments or advanced filters.
How to avoid data sampling in Google Analytics?
There are two surefire ways to avoid data sampling in Google Analytics. The first is to use only default reports without the addition of segments or filters that may pull in more sessions than your property threshold allows. The second is to reduce your date range so you’re pulling in fewer sessions.
Conclusion
Even if you can’t avoid data sampling in Google Analytics in all cases, you can minimize the impact it has on your reports.
By understanding what data sampling is, why it’s used, and how it works, you can begin to modify your GA reports to ensure the data meets your needs.
Best of all, you’ll once again be the master of your website performance data.
Do you have questions about data sampling in GA? Let us know in the comments below.
No Comments