Skip to content

Proactive Monitoring

Uptime Monitoring answers one question: is the server up? That’s the minimum. Production sGTM needs monitoring that catches the failure modes that don’t show up at /healthz:

  • A tag template that was published yesterday is returning errors 10% of the time.
  • Meta CAPI is rejecting events because an access token rotated without anyone updating the tag.
  • A spike in request volume is burning through your Cloud Run budget.
  • Traffic from one country dropped 40% overnight because a CMP config changed.

None of those surface as outages. All of them cost money or accuracy. The pattern below — log-based metrics, SLOs with error budgets, and anomaly detection on request volume — is what catches them before someone asks why the numbers are off.

The four metric families worth alerting on

Section titled “The four metric families worth alerting on”
FamilyWhat it measuresTypical threshold
AvailabilityFraction of requests returning 2xx< 99.9% for 5 minutes
LatencyP95 response time> 500ms for 5 minutes
Downstream success rateGA4/Meta/TikTok CAPI non-error responses< 95% for 10 minutes
Request volume anomalyRate vs. rolling 7-day baseline< 50% or > 200% for 15 minutes

These four together catch most production issues. Add more at your peril — every additional alert either adds signal or trains the team to ignore alerts. Start with these, tune for 4–6 weeks, then add specific alerts for known failure modes.

The most useful sGTM observability signal is structured log entries. Every tag fire, every client claim, every error writes a line to Cloud Logging. Log-based metrics let you count entries matching a filter and alert on the rate.

  1. Create a log-based metric for tag errors

    GCP Console → Logging → Log-based Metrics → Create Metric:

    • Name: sgtm_tag_errors
    • Filter:
      resource.type="cloud_run_revision"
      resource.labels.service_name="gtm"
      severity>=ERROR
    • Type: Counter
  2. Create an alerting policy on the rate

    Cloud Monitoring → Alerting → Create Policy:

    • Metric: logging/user/sgtm_tag_errors
    • Condition: rate > 5 per minute for 5 minutes
    • Notification: your on-call channel
  3. Create log-based metrics for downstream response codes

    sGTM logs downstream HTTP status codes for each outbound call. A separate metric filtered by jsonPayload.response_code>=400 gives you the downstream failure rate:

    resource.type="cloud_run_revision"
    resource.labels.service_name="gtm"
    jsonPayload.response_code>=400

    Tag the metric with labels for destination (jsonPayload.destination) so you can alert per-vendor. Meta CAPI 401s are a rotating-token problem; GA4 MP 400s are usually a malformed event — they need different responders.

An SLO (Service Level Objective) is a measurable availability target. An error budget is the inverse: how much failure you allow before you stop deploying and start fixing.

For sGTM, a practical SLO pair:

  • Availability SLO: 99.9% of requests return 2xx over a rolling 28-day window.
  • Latency SLO: P95 response time ≤ 500ms over the same window.

A 99.9% availability SLO gives you a 0.1% error budget — roughly 43 minutes of total-failure-equivalent per 28 days, or equivalently a 0.1% failure rate sustained for 28 days. When you’ve burned more than half the budget in the first half of the window, you stop shipping template changes and focus on reliability until the window resets.

GCP implements SLOs natively in Cloud Monitoring → Services → SLOs. Point the SLO at the uptime check or at a log-based metric, set the target, and Cloud Monitoring tracks burn rate automatically. Set burn-rate alerts at 2% (fast burn, indicates something broke) and 5% (slow burn, indicates a creeping problem).

Absolute thresholds on request rate don’t work — sGTM traffic follows business patterns (daily cycles, weekday vs. weekend, seasonal). What you want is: “is today’s 2pm request rate meaningfully different from last Tuesday’s 2pm request rate?”

Cloud Monitoring supports this natively via “anomaly detection” conditions on metrics. For request count (run.googleapis.com/request_count), set the condition type to “Anomaly” and specify the comparison window (typical: 7 days). The alert fires when today’s rate deviates more than 3 standard deviations from the baseline.

Alternative — scheduled BigQuery query: if your Cloud Logging sinks to BigQuery, a scheduled query comparing current-hour volume to a 7-day-prior same-hour average is simple and explicit:

WITH current_hour AS (
SELECT COUNT(*) AS cnt
FROM `project.logs.sgtm_requests`
WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND CURRENT_TIMESTAMP()
),
baseline AS (
SELECT AVG(cnt) AS avg_cnt, STDDEV(cnt) AS stddev_cnt
FROM (
SELECT COUNT(*) AS cnt, EXTRACT(HOUR FROM timestamp) AS hour,
EXTRACT(DAYOFWEEK FROM timestamp) AS dow,
DATE(timestamp) AS day
FROM `project.logs.sgtm_requests`
WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
AND EXTRACT(HOUR FROM timestamp) = EXTRACT(HOUR FROM CURRENT_TIMESTAMP())
AND EXTRACT(DAYOFWEEK FROM timestamp) = EXTRACT(DAYOFWEEK FROM CURRENT_TIMESTAMP())
GROUP BY hour, dow, day
)
)
SELECT
current_hour.cnt,
baseline.avg_cnt,
(current_hour.cnt - baseline.avg_cnt) / NULLIF(baseline.stddev_cnt, 0) AS z_score
FROM current_hour, baseline;

Alert when ABS(z_score) > 3. Run hourly.

One dashboard, one screen, four panels:

  1. Request rate (line chart, current vs. 7-day prior same day).
  2. Error rate (line chart, percentage of 4xx and 5xx).
  3. P50 / P95 / P99 latency (line chart with three series).
  4. Downstream success rate by destination (stacked chart: GA4, Meta, TikTok, custom HTTP, each as its own series).

That’s it. Additional panels (instance count, CPU, memory) belong on a deeper debugging dashboard, not the overview. The overview should be readable in 10 seconds at 2am by the person who got paged.

Every alert has a runbook. When the alert fires, the notification should include a link to a runbook that says: what the alert means, how to confirm it’s real, what to check first, how to resolve the most common causes. Alerts without runbooks are alerts the team doesn’t know how to action.

Every alert gets tested. After creating an alert, intentionally trigger the condition (e.g., deploy a broken template in preview, run a synthetic high-traffic test) and confirm the notification arrives. Alerts created but never tested have a non-trivial chance of being silently misconfigured.

Review alerts quarterly. Delete alerts that haven’t fired in 6 months (probably covering a non-existent failure mode). Tune alerts that fire weekly without anyone taking action (threshold is too tight). The goal is that every firing alert leads to a human investigating.

Alerting on downstream vendor errors as if they were your problem. Meta CAPI returning 400 for a specific event is often a vendor-side issue — a rejected payload format, a changed schema. You want to know about it, but paging the on-call at 2am for a 0.5% Meta rejection rate doesn’t help anyone. Route downstream-vendor alerts to an email or Slack channel, not PagerDuty.

Using absolute thresholds on rates. “Alert when request rate < 100/min” works during business hours and fires every night at 3am. Use anomaly detection or business-hours-qualified thresholds instead.

Setting P95 latency alerts below what’s physically achievable. If your sGTM routinely has 300ms P95 latency and you set an alert at 200ms, it fires constantly and teaches the team to ignore it. Set latency alerts ~50% above observed baseline, not at aspirational targets.

Monitoring nothing except /healthz. Health-check-only monitoring is better than nothing, but it misses the majority of real production issues. A container that responds 200 to /healthz while returning 500 to every actual event is “healthy” to basic monitoring. Log-based metrics catch this.

Configuring monitoring and never looking at the dashboards. Dashboards and alerts are useful in proportion to how often they’re looked at and tuned. A “set and forget” monitoring setup drifts out of usefulness within 6 months as traffic patterns change, deployments alter error fingerprints, and new failure modes emerge that existing alerts don’t cover.