SLOs, Error Budgets, and Production Reliability: A Practical Guide
An SLO is a time-bounded reliability target for a user-facing journey; its error budget is the allowable bad events in that window—healthy teams alert on budget burn, not every metric blip, and slow launches when the budget is spent.
Service Level Objectives (SLOs) are internal reliability targets expressed as a percentage over a window—for example, 99.9% of API requests complete successfully in under 300 ms each calendar month. Service Level Indicators (SLIs) are the measurable good events divided by valid events; error budgets are the allowable unreliability (e.g. 0.1% in a 99.9% SLO) you can spend on launches, refactors, or aggressive rollouts. This guide shows how to choose SLIs, set realistic targets, tie alerts to SLO burn rather than noisy thresholds, and align product and engineering on trade-offs. The framing follows Google's Site Reliability Engineering book (Chapter "Service Level Objectives"), which popularized error budgets as a way to balance velocity and stability.
Key takeaways
Measure user-perceived reliability (successful, fast-enough requests)—not just server uptime. A service can be "up" but unusably slow.
Pick a small number of SLIs (often 1–3 per user journey): availability, latency, and sometimes freshness or correctness.
Connect on-call alerts to burn rate (how fast you consume the error budget), not to every blip in raw metric charts.
When the budget is exhausted or nearly so, slow feature launches and prioritize reliability work until the rolling window recovers.
SLOs are a product of organisational honesty: if leadership will not tolerate slowing feature work when budgets burn, the SLO is theatre. Write down the policy before incidents force improvised politics.
From SLI to SLO: concrete examples
Availability SLI: proportion of HTTP GET /v1/orders/{id} calls that return 2xx or valid 404 (not 5xx) over all calls excluding client aborts. Latency SLI: proportion of those calls with server-side duration ≤ 200 ms at the edge, measured at the load balancer.
Example 99.9% monthly availability SLO: across ~43.8 million requests in 30 days, you can have ~43,800 bad requests before missing the objective. That remainder is your error budget for planned risk.
Stricter tiers exist: 99.95% allows ~4× fewer bad events than 99.9%; 99.99% is an order of magnitude stricter again—each "nine" materially increases engineering and redundancy cost.
Latency: percentiles vs SLI
Raw p99 dashboards help debugging but make poor SLOs alone because a single long incident can dominate. SRE-style SLIs often encode latency as a proportion under a threshold ("99% of requests faster than 300 ms") combined with availability.
Define whether you measure server-side, client-side, or end-to-end latency; each tells a different story. Mobile clients need network variance in the narrative.
Cold starts, cache misses, and GC pauses show up as long tails—document which components contribute before you tighten thresholds arbitrarily.
Customer-facing SLAs versus internal SLOs
SLAs are contractual promises with remedies; SLOs are internal targets that should be stricter than SLAs to provide buffer. Publishing an SLA without an internal SLO invites accidental breach.
Differentiate admin tools from customer APIs—tightening one budget to match the other wastes money or sets the wrong external expectation.
SLO review cadence and error budget policy
Review SLOs quarterly: are thresholds still aligned with user needs and cost? Stale SLOs either burn people out or provide false confidence.
Write an error budget policy: who approves launches when budget is low, how feature freezes work, and how exceptions are recorded. Ad-hoc heroics are not governance.
Using error budgets in product decisions
If budget is healthy, teams may absorb more release risk: feature flags default on, database migrations proceed during business hours with monitoring.
If budget is depleted, freeze risky changes, add capacity, fix flaky dependencies, and postpone large refactors until burn slows—this is how reliability becomes an explicit product trade-off instead of an afterthought.
Document decisions: "We accept 99.5% API availability for this internal admin tool vs 99.9% for the customer API" so stakeholders do not debate targets mid-incident.
Alerting: multi-window burn rates
Google's multi-burn-rate alerting (described in SRE workbook material) uses short and long windows—for example, 2% budget burn in 1 hour (page quickly) vs slow burn over days (ticket, not page). This reduces pager fatigue while catching real regressions.
Every alert should link to a runbook: dashboards, likely causes, rollback commands, and escalation—not merely "high error rate".
Tune alert thresholds from historical incident data, not defaults copied from blog posts—your traffic shape matters.
Toil, capacity, and human cost
If on-call load rises as you add SLOs, you likely have too many objectives or noisy SLIs. Consolidate journeys before hiring more pager rotations.
Capacity planning should reference saturation signals tied to SLOs—CPU alone is a weak proxy for user-visible latency under load.
Limitations
SLOs describe steady-state user experience; they do not replace security monitoring, fraud detection, or data-quality checks.
Choosing thresholds without baseline measurements produces gaming (narrow valid-event definitions). Start from measured distributions, then tighten.
Very small services with low traffic have noisy SLIs; use longer windows or merge correlated journeys.
Embedding SLOs in developer workflows
Developers should see SLO status in CI or deployment dashboards—when a canary consumes budget, rollbacks should be one click and culturally normal.
Add SLO regression checks to release notes for high-risk services: what changed, what was observed in staging load tests, and which dashboards to watch post-deploy.
User journeys and dependency mapping
Define journeys from the user's perspective: "submit expense" not "service X health". Map upstream dependencies per journey so incidents trace to customer impact quickly.
Third-party SaaS dependencies belong in SLO narratives—even if you cannot control their uptime, you can control timeouts, caching, and graceful degradation paths.
SLOs for batch and data pipelines
Not everything is request/response. For pipelines, define timeliness SLOs: percentage of jobs completing within N minutes of schedule, and freshness SLIs for downstream analytics consumers.
Separate user-facing SLOs from internal ETL SLOs—mixing them confuses incident response priorities.
Executive storytelling with SLOs
Translate error budgets into business language: "We consumed half our monthly budget in three days during the sale—feature freezes protect revenue next week."
Avoid false precision: show ranges and confidence when traffic is low; executives should understand uncertainty, not hide it.
First thirty days of SLO adoption (practical sequence)
Week 1: pick one critical user journey; instrument good/bad events honestly; accept messy data before optimising dashboards.
Week 2: draft SLIs from measured baselines—not aspirational targets—and preview what an SLO would allow in error budget.
Week 3: socialise targets with product and get explicit agreement on freeze/rollback policy when budgets burn fast.
Week 4: wire burn-based alerts with runbooks; run a tabletop exercise on a synthetic outage to validate paging makes sense.
Explore Product Strategy, Custom Software, and AI Development. If a build has stalled, see software project rescue. When you are ready to talk, get in touch.