Proactive Monitoring
Uptime Monitoring answers one question: is the server up? That’s the minimum. Production sGTM needs monitoring that catches the failure modes that don’t show up at /healthz:
- A tag template that was published yesterday is returning errors 10% of the time.
- Meta CAPI is rejecting events because an access token rotated without anyone updating the tag.
- A spike in request volume is burning through your Cloud Run budget.
- Traffic from one country dropped 40% overnight because a CMP config changed.
None of those surface as outages. All of them cost money or accuracy. The pattern below — log-based metrics, SLOs with error budgets, and anomaly detection on request volume — is what catches them before someone asks why the numbers are off.
The four metric families worth alerting on
Section titled “The four metric families worth alerting on”| Family | What it measures | Typical threshold |
|---|---|---|
| Availability | Fraction of requests returning 2xx | < 99.9% for 5 minutes |
| Latency | P95 response time | > 500ms for 5 minutes |
| Downstream success rate | GA4/Meta/TikTok CAPI non-error responses | < 95% for 10 minutes |
| Request volume anomaly | Rate vs. rolling 7-day baseline | < 50% or > 200% for 15 minutes |
These four together catch most production issues. Add more at your peril — every additional alert either adds signal or trains the team to ignore alerts. Start with these, tune for 4–6 weeks, then add specific alerts for known failure modes.
Log-based metrics (GCP)
Section titled “Log-based metrics (GCP)”The most useful sGTM observability signal is structured log entries. Every tag fire, every client claim, every error writes a line to Cloud Logging. Log-based metrics let you count entries matching a filter and alert on the rate.
-
Create a log-based metric for tag errors
GCP Console → Logging → Log-based Metrics → Create Metric:
- Name:
sgtm_tag_errors - Filter:
resource.type="cloud_run_revision"resource.labels.service_name="gtm"severity>=ERROR
- Type: Counter
- Name:
-
Create an alerting policy on the rate
Cloud Monitoring → Alerting → Create Policy:
- Metric:
logging/user/sgtm_tag_errors - Condition:
rate > 5 per minutefor 5 minutes - Notification: your on-call channel
- Metric:
-
Create log-based metrics for downstream response codes
sGTM logs downstream HTTP status codes for each outbound call. A separate metric filtered by
jsonPayload.response_code>=400gives you the downstream failure rate:resource.type="cloud_run_revision"resource.labels.service_name="gtm"jsonPayload.response_code>=400Tag the metric with labels for destination (
jsonPayload.destination) so you can alert per-vendor. Meta CAPI 401s are a rotating-token problem; GA4 MP 400s are usually a malformed event — they need different responders.
SLOs and error budgets
Section titled “SLOs and error budgets”An SLO (Service Level Objective) is a measurable availability target. An error budget is the inverse: how much failure you allow before you stop deploying and start fixing.
For sGTM, a practical SLO pair:
- Availability SLO: 99.9% of requests return 2xx over a rolling 28-day window.
- Latency SLO: P95 response time ≤ 500ms over the same window.
A 99.9% availability SLO gives you a 0.1% error budget — roughly 43 minutes of total-failure-equivalent per 28 days, or equivalently a 0.1% failure rate sustained for 28 days. When you’ve burned more than half the budget in the first half of the window, you stop shipping template changes and focus on reliability until the window resets.
GCP implements SLOs natively in Cloud Monitoring → Services → SLOs. Point the SLO at the uptime check or at a log-based metric, set the target, and Cloud Monitoring tracks burn rate automatically. Set burn-rate alerts at 2% (fast burn, indicates something broke) and 5% (slow burn, indicates a creeping problem).
Anomaly detection on request volume
Section titled “Anomaly detection on request volume”Absolute thresholds on request rate don’t work — sGTM traffic follows business patterns (daily cycles, weekday vs. weekend, seasonal). What you want is: “is today’s 2pm request rate meaningfully different from last Tuesday’s 2pm request rate?”
Cloud Monitoring supports this natively via “anomaly detection” conditions on metrics. For request count (run.googleapis.com/request_count), set the condition type to “Anomaly” and specify the comparison window (typical: 7 days). The alert fires when today’s rate deviates more than 3 standard deviations from the baseline.
Alternative — scheduled BigQuery query: if your Cloud Logging sinks to BigQuery, a scheduled query comparing current-hour volume to a 7-day-prior same-hour average is simple and explicit:
WITH current_hour AS ( SELECT COUNT(*) AS cnt FROM `project.logs.sgtm_requests` WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) AND CURRENT_TIMESTAMP()),baseline AS ( SELECT AVG(cnt) AS avg_cnt, STDDEV(cnt) AS stddev_cnt FROM ( SELECT COUNT(*) AS cnt, EXTRACT(HOUR FROM timestamp) AS hour, EXTRACT(DAYOFWEEK FROM timestamp) AS dow, DATE(timestamp) AS day FROM `project.logs.sgtm_requests` WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) AND EXTRACT(HOUR FROM timestamp) = EXTRACT(HOUR FROM CURRENT_TIMESTAMP()) AND EXTRACT(DAYOFWEEK FROM timestamp) = EXTRACT(DAYOFWEEK FROM CURRENT_TIMESTAMP()) GROUP BY hour, dow, day ))SELECT current_hour.cnt, baseline.avg_cnt, (current_hour.cnt - baseline.avg_cnt) / NULLIF(baseline.stddev_cnt, 0) AS z_scoreFROM current_hour, baseline;Alert when ABS(z_score) > 3. Run hourly.
What a useful dashboard looks like
Section titled “What a useful dashboard looks like”One dashboard, one screen, four panels:
- Request rate (line chart, current vs. 7-day prior same day).
- Error rate (line chart, percentage of 4xx and 5xx).
- P50 / P95 / P99 latency (line chart with three series).
- Downstream success rate by destination (stacked chart: GA4, Meta, TikTok, custom HTTP, each as its own series).
That’s it. Additional panels (instance count, CPU, memory) belong on a deeper debugging dashboard, not the overview. The overview should be readable in 10 seconds at 2am by the person who got paged.
Alert hygiene
Section titled “Alert hygiene”Every alert has a runbook. When the alert fires, the notification should include a link to a runbook that says: what the alert means, how to confirm it’s real, what to check first, how to resolve the most common causes. Alerts without runbooks are alerts the team doesn’t know how to action.
Every alert gets tested. After creating an alert, intentionally trigger the condition (e.g., deploy a broken template in preview, run a synthetic high-traffic test) and confirm the notification arrives. Alerts created but never tested have a non-trivial chance of being silently misconfigured.
Review alerts quarterly. Delete alerts that haven’t fired in 6 months (probably covering a non-existent failure mode). Tune alerts that fire weekly without anyone taking action (threshold is too tight). The goal is that every firing alert leads to a human investigating.
Common mistakes
Section titled “Common mistakes”Alerting on downstream vendor errors as if they were your problem. Meta CAPI returning 400 for a specific event is often a vendor-side issue — a rejected payload format, a changed schema. You want to know about it, but paging the on-call at 2am for a 0.5% Meta rejection rate doesn’t help anyone. Route downstream-vendor alerts to an email or Slack channel, not PagerDuty.
Using absolute thresholds on rates. “Alert when request rate < 100/min” works during business hours and fires every night at 3am. Use anomaly detection or business-hours-qualified thresholds instead.
Setting P95 latency alerts below what’s physically achievable. If your sGTM routinely has 300ms P95 latency and you set an alert at 200ms, it fires constantly and teaches the team to ignore it. Set latency alerts ~50% above observed baseline, not at aspirational targets.
Monitoring nothing except /healthz. Health-check-only monitoring is better than nothing, but it misses the majority of real production issues. A container that responds 200 to /healthz while returning 500 to every actual event is “healthy” to basic monitoring. Log-based metrics catch this.
Configuring monitoring and never looking at the dashboards. Dashboards and alerts are useful in proportion to how often they’re looked at and tuned. A “set and forget” monitoring setup drifts out of usefulness within 6 months as traffic patterns change, deployments alter error fingerprints, and new failure modes emerge that existing alerts don’t cover.