SLOs and alert fatigue: a practical guide

The problem with alert-on-everything

Most teams alert on symptoms they can measure, not on outcomes that matter to users. The result: an on-call queue full of CPU and memory alerts, while actual user-facing degradation goes undetected until someone files a support ticket.

SLOs (Service Level Objectives) invert this. You define what "good" looks like for users, measure it, and alert only when you're burning through your error budget too fast.

Defining a useful SLO

An SLO has three components:

SLI (indicator): What you measure. For most services, start with request success rate and latency at the 95th percentile.
Objective: The threshold. E.g., "99.5% of requests succeed within 500ms over a 30-day rolling window."
Error budget: The 0.5% you're allowed to fail. This is your innovation budget — how much you can break things in pursuit of features.

Good SLIs are user-journey focused, not infrastructure-focused. "Checkout page loads in under 2 seconds" is better than "p95 latency under 200ms on the payments microservice".

Burn-rate alerting

Don't alert when you breach the SLO. Alert when you're consuming your error budget faster than sustainable. Google's approach:

Fast burn: Consuming 5% of monthly budget in 1 hour → wake someone up immediately.
Slow burn: Consuming 10% of monthly budget in 6 hours → ticket for next business day.

This means a major outage wakes the on-call engineer. A minor degradation that would resolve itself doesn't.

Prometheus + Grafana implementation

With Prometheus, record a ratio metric:

record: job:request_success_rate:ratio_rate5m
expr: |
  sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))

Then define burn-rate alerts in alertmanager:

- alert: SLOBurnRateFast
  expr: |
    job:request_success_rate:ratio_rate1h < (1 - 0.005) * 14.4
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO fast burn — {{ $value | humanizePercentage }} success rate"

The 14.4 multiplier means the 1-hour rate is burning 14.4x the sustainable rate (fast burn threshold).

What to do with error budget

When your error budget is healthy, deploy features. When it's low, freeze risky changes and focus on reliability. When it's gone, halt feature work entirely until the SLO is met — this isn't a punishment; it's the mechanism that makes the trade-off explicit.

Starting small

You don't need to instrument everything at once. Pick your three most user-facing services. Define one SLI and one objective per service. Set up burn-rate alerts. Iterate from there.

We help teams set up SLO frameworks from scratch — book a discovery call if you'd like to discuss your observability stack.

SLOs and alert fatigue: a practical guide

The problem with alert-on-everything

Defining a useful SLO

Burn-rate alerting

Prometheus + Grafana implementation

What to do with error budget

Starting small

Want help applying this to your infrastructure?

More from Strataform

Terraform module patterns that scale

EKS vs ECS: trade-offs for product teams

GitHub Actions pipelines that don't slow you down