Companies no longer collect clickstream data with Google Analytics 4 (GA4) solely to enable data-informed decision making by inspecting and analysing the collected user data, assessing marketing campaign performance or reporting on the most valuable landing pages. Companies looking to create tangible business value use it as a data collection tool that feeds their marketing platforms with conversions and valuable audiences.
This has always been the case, but with the advent of GTM Server-Side (sGTM) and its growing capabilities, this integration is deepening.

As illustrated in the above visualisation, companies are migrating their data collection for services like Google Ads, Meta, Floodlight and others from their client-side to server-side containers. This shift offers significant benefits, including additional data control, optimised page load speed and the ability to enrich data streams with first-party data in real-time. This improved data collection strategy opens up opportunities for customised solutions such as real-time dashboards and personalised communication.
While the shift to GA4 and sGTM is a positive step that has the potential to improve digital marketing performance (read one of our case studies on how to utilise sGTM to improve marketing performance here), it also brings a new challenge that has so far gone over the heads of those rushing to adopt this approach. The growing number of vendors and tools that rely on a single GA4 data stream highlights the critical importance of accurate data collection. Businesses must prioritise data quality to ensure the effectiveness of their digital marketing strategies.
The traditional QA flow: A recipe for losing trust
The complexity and risks associated with the dominant approach become apparent when we examine the current state of data quality assurance (QA) in organisations.
The traditional QA process for GA4 data collection as well as measurement implementations for other marketing vendors involves multiple departments and different layers – as you can see in the figure below. Usually, website developers and the measurement team collaborate on the dataLayer specifications and their implementation. The exposed events and associated values drive the tag management system (e.g. Google Tag Manager), where GA4 tags are configured to be triggered, read the requested values and send them to the GA4 servers. Finally, the processed data is made available to a wide range of data consumers and decision makers in the organisation via GA4 UI or dedicated dashboards.

When GA4 data quality measures aren’t in place, data recipients usually identify inconsistencies in data and report their findings to you and your colleagues on the measurement team. If you’ve ever received a call or email pointing out discrepancies in your data collection while you were getting ready for the weekend, you can understand how it can ruin your weekend plans (at least I can). When you receive this message, the work begins: your team checks to see if they can confirm the issue or if the recipients lack context or training. If they can confirm the issue, they need to identify the source – most likely by trying to replicate it. In that process, they need to communicate with the data consumers and web developers before the culprit is identified and eventually resolved.
You can see how frustratingly cumbersome this process is. But it’s still how most companies work out there. With such a long QA flow, errors can spread quickly and affect multiple areas before they are discovered. Furthermore, the involvement of different departments increases response times and delays the resolution of errors. Inadequate tools often prevent effective troubleshooting at scale.
The consequences of these shortcomings are serious:
- Delayed insights: Errors found by data consumers often mean that the data has been wrong for a period of time, leading to delayed or misunderstood insights.
- Reduced trust: Repeated errors detected by end users can undermine trust in the analytics platform and make stakeholders less confident in the data.
- Increased workload: Data consumers, tracking teams and development teams all face an increased workload to identify, report and resolve issues, which can divert resources from more strategic initiatives.
- Operational inefficiencies: Finding and fixing errors after collection leads to operational inefficiencies.
From reactive to proactive data quality monitoring
To overcome the problems with this reactive approach, as a responsible member of your tracking team, you should develop proactive measures that catch errors as early as possible, before they manifest and are exposed as facts to your data consumers. So let’s explore some of the options available to us, from built-in GA4 features to utilising the BQ raw data export and implementing a custom validation endpoint in the following section.
Utilising GA4 Insights
GA4 has a built-in insights feature that can automatically detect changes in your data. By creating custom alerts (yes, it’s hard to let go of the good old UA terminology), you can monitor important data changes and receive email notifications when certain defined conditions are met. For example, you can create alerts for significant drops in active users or purchase events, ensuring timely intervention.

The configuration is simple and allows for a great deal of flexibility:
- Evaluation frequency: Hourly (web only), daily, weekly, monthly
- Segment:All users is the default segment. Toggle to select other dimensions and dimension values. Specify whether to include or exclude the segment
- Metric: Select the metric, condition and value to set the threshold that triggers the insight. For example: 30-day active users – % decrease by more than -20. If you select Have anomaly as condition, GA4 determines when the change in the metric is anomalous and you do not need to enter a value.
The ability to set up these notifications is a good step in the right direction as it increases the chances of the measurement team detecting errors in the data collection mechanism instead of putting this task on the data consumers. By moving quality assurance closer to the source, we gain valuable time and control over the handling of these issues (especially from a communication perspective to our end users).
The custom insights feature is excellent at detecting unexpected fluctuations (=data volume) in relevant metrics, and most importantly, it’s free. Yet it is not able to solve potential data quality issues. To do so, we need much more detailed configuration options so we can perform event-level checks.
Improving our game with BigQuery and Dataform
I’m not the first to tell you this, but here goes: Exporting your GA4 raw data to BQ will help you get the most out of your GA4 data. In this case, it’s our gateway to implementing effective data quality controls for our data.

By integrating BigQuery with your GA4 setup, you can implement custom evaluation rules using SQL – limited only by your imagination. This integration makes BigQuery a powerful tool for monitoring data quality.
Using SQL in BigQuery, it is possible to develop customised rules to evaluate your data. For example, you can set up rules to validate event data structures and ensure they meet your predefined standards. To give you some inspiration:
- Are all the anticipated ecommerce events tracked?
- Do all these events have an element array associated with at least one element?
- Do all items have an item ID, a quantity and a value?
- Do all purchase events have a transaction ID (in the expected format) and is the purchase revenue greater than 0?
- And so on.
It’s up to you to wrap your business logic in an SQL query that calculates the percentage of falsy events and identifies patterns as to why this happens (e.g. specific page paths or browsers). Now you might be asking yourself: Do you really expect me to run a query like this at the end of every day before I go home?
No, not necessarily. You can use Dataform to improve this further by allowing you to build and operationalise scalable data transformation pipelines. In particular, Dataform will enable you to ease the work of planning and evaluating the results of data quality queries. Dataform allows you to easily plan queries and implement validations on top of query results using assertions.

For example, in the Dataform configuration below, I ask for page views from my blog where the custom dimension author is not filled in. In addition, I specify a rule that the proportion of these ‘bad’ page views should not be more than 10% of all measured page views using Dataform’s assert function.

Many of you may already be using Dataform in your workflows to automate the creation of aggregated datasets for reporting, and adding data quality statements to the mix requires minimal extra work, but it ensures that you and your team deliver high-quality data to your end users. If you haven’t looked at Dataform yet, I highly recommend checking it out – don’t miss out on a huge time saver.
Real-time data validation
But wouldn’t it be even better if we could somehow move the evaluation of data quality up to the point where the data actually originates, enabling real-time monitoring of all the events we collect? If so, we could react even faster to errors and correct mistakes before days of data are compromised. We could manage the communication around how to handle these errors. We could even decide what to do with this erroneous data before it got into any downstream systems (- remember the first image in this article?).
The benefits of such real-time validation are huge and can enable a much more streamlined QA process for our GA4 data.
The concepts of schemas and data contracts
The GA4 website data collection naturally originates from our users’ browsers when they visit our website. We then enable tracking of relevant user interactions via the website’s dataLayer (especially for e-commerce tracking) and use the dataLayer events to trigger our GA4 event tags, which also collect metadata according to the tags’ configurations. Finally, the events and associated data are sent via requests directly to GA4 servers or our own sGTM container for further processing.

As the overview above illustrates, all three GA4 data sources consist of objects and key-value pairs that describe a given purchase event and its properties. Therefore, we can think of each GA4 event as a JSON object that contains all the necessary information about it so it can be processed in GA4.
The above JSON contains information about a specific event, but omits details that can lead to certain limitations when sent to GA4. For example, the above JSON objects could be:
- Unclear: The examples don’t tell us which fields are mandatory or optional or what their respective types are. For example, we don’t know if the transaction ID field should always be a string or a number. If it should be a string, we also don’t know its format. Should transaction_id always start with a ‘T-’ as in the example?
- Incomplete: JSON objects lack complete data context, such as whether an item object should contain an item_id or item_name field, and it does not specify which fields can be omitted.
- No enforcement: JSON objects lack standardised validation and constraints, so they can’t enforce rules like an event requires a user_id whose login_status equals ‘loggedIn’ or that the item_category value is from a predefined list.
For the standard GA4 events, Google provides us with extensive documentation on the required key-value pairs and the types of values. Furthermore, if you’ve been serious about your data collection in the past, rest assured that you also have documentation on custom events and associated parameters.
The need to validate an instance of a JSON object against a predefined set of rules that this object must adhere to is fortunately nothing new in programming and has already been solved. One of the most effective ways to ensure data consistency and validity is through JSON Schemas. JSON Schema is a blueprint for JSON data that solves all the problems mentioned. It defines the rules, structure, and constraints that the data must follow.
JSON Schema uses a separate JSON document to deliver the JSON data plane, which means that the schema itself is also machine and human readable. Or to rephrase it: We use JSON to describe JSON.
Let’s look at what the schema for our example DataLayer purchase event above might look like:

As you can see, the schema above gives much more context to our original purchase dataLayer object, among other things:
- Allowed values
- Pattern constraints (RegEx validation)
- List validation (e.g. minimum number of objects in a list)
- Key validation (e.g. there must be key-value pairs in an object)
This schema then acts as a kind of data contract that all GA4 events or dataLayer objects must adhere to in order to be considered valid.
Monitoring a website’s dataLayer
With the JSON schema in our toolbox and the full power of GCP at our disposal, we can now put all the pieces together by building a lightweight validation endpoint capable of receiving DataLayer event objects from the website and validating them against the schemas you have defined.

It requires a custom HTML tag in the website container that reads the requested dataLayer object and sends it as a payload to our validation application, which can be hosted on Cloud Run, App Engine or Cloud Functions. The application will read the schema definitions and compare them to the received dataLayer events. The results of this validation (e.g. valid or not, potential error messages, etc.) will be written to BigQuery or application logs.

The resulting BigQuery tables allow for a data quality dashboard that can be shared with all stakeholders – especially the measurement team and the web developers responsible for the dataLayer implementation. The centralised dashboard can provide an overview of your data quality status, alert you to potential issues and allow you to take proactive measures.
Monitoring the GA4 event stream
If we use sGTM for our GA4 data collection, we can integrate our custom validation endpoint with the container by using an asynchronous custom variable that forwards the event data object analysed by the GA4 client to the validator endpoint. The service will then respond with the validation result as below:

Having this information available before data is made available to GA4 or other downstream vendors like G Ads, Meta or others gives full control over how to treat events that are compromised:
- Should these events be dropped altogether?
- Should they be directed to a separate GA4 property for further investigation?
- Should compromised events be used to trigger marketing tags?
- Should the GA4 events simply be enriched with a data quality parameter?

While the previous solution is good for monitoring your data source and getting notified when something breaks, a direct integration with sGTM and real-time enrichment of the GA4 data stream actually allows you to enforce the data contract.
Benefits of proactive data quality monitoring
Uanset hvilken løsning du vælger, giver proaktiv datakvalitet lige ved kilden til dataovervågning flere betydelige fordele, der sikrer, at dine analysedata forbliver nøjagtige, pålidelige og brugbare.

1. Increased confidence in data
When data is consistently accurate and reliable, stakeholders develop greater trust in the analytics platform. This trust is essential to making informed business decisions and laying out an effective strategy.
2. Operational efficiency
By catching errors early in the data pipeline, proactive monitoring reduces the need for extensive cleaning and correction of data after collection. This efficiency saves time and resources, allowing teams to focus on more strategic initiatives instead of firefighting data problems.
3. Savings on costs
It is generally less expensive to identify and resolve data quality issues early in the process than to address them once they have affected downstream systems and reports. Proactive monitoring helps avoid the financial consequences of poor data quality on business operations.
4. Improved decision making
High-quality data leads to better analytics and insights that are critical to making sound business decisions. Proactive monitoring ensures that decision makers have access to accurate and timely information, reducing the risk of making decisions based on incorrect data.
Conclusion
Proactive monitoring of data quality is not just a best practice; it is a necessity in the modern data landscape. By implementing robust monitoring and validation systems for their behavioral data collection in GA4, organizations can ensure the integrity of their data, build stakeholder trust and maintain operational efficiency. The shift from reactive to proactive monitoring provides a strategic advantage that makes data quality management a competitive differentiator.
Investing in proactive monitoring not only protects your data, but also improves your organization’s ability to make timely, informed and impactful decisions.
If you need further help or want to discuss how we can help you ensure data quality at scale, please feel free to contact us at IIH Nordic.