SRE and Observability Framework: SLOs Define Reliability

Published

15 Jun 2026

Author

Laxmi Hari Nepal

Table of Contents

Two months after launch, the on-call rotation is the only thing the engineering team talks about. Every degradation gets handled as a P0. Every latency spike wakes someone at 3am. Every flaky third-party integration produces a ticket marked URGENT, an incident channel, and a post-mortem that nobody has the energy to act on. The team responds to all of it because the standard is unwritten — there is no agreed line between "this needs human attention now" and "this is within acceptable variance, log it and move on." Six months later two senior engineers have left, the team is debating whether to disable alerting on the noisiest service, and reliability has actually gotten worse because the engineers who would have fixed the underlying issues spent their nights fighting symptoms.

This is the failure pattern an SRE observability SLO discipline is built to prevent. Not by adding more dashboards — most teams already have plenty — but by giving the team a measurable reliability target the business has signed off on. A service level objective. A number both engineering and product can hold. Once the SLO exists, the rest of the operating model rearranges itself around it: alerts fire on what threatens the SLO, error budgets guide the trade-off between shipping features and improving reliability, and on-call stops being a noise-management exercise and becomes a deliberate response to events that matter.

In the Built to Last™ 2.0 framework, the SRE observability SLO framework sits in P05 — The Right Code — alongside DevSecOps, IAM least-privilege, and secrets management. Each component is a build standard the pipeline enforces. The SRE framework's specific contribution is to translate reliability from a feeling — "the system seems flaky" — into a number both engineering and the business can read the same way. This article walks through what an SLO actually looks like for a product team, how error budgets change behaviour, and how to set the whole thing up before launch so the first months in production are about operating the system rather than firefighting it.

Why The Team Without SLOs Burns Out First

The cost of operating without defined reliability targets compounds in a predictable order. Week one in production, the team responds to everything because the system is new and they want to know it well. Week four, the volume of alerts has not dropped, and the team is starting to triage during incidents rather than respond — but every alert is still treated as critical because no agreed standard exists for what should be ignored. Week ten, a real incident — an outage in a payment flow, say — fires the same alert pattern as the previous thirty false positives, and the response is slower than it would have been at week one.

The next stage is attrition. The engineers who carry the on-call burden are the ones with enough system knowledge to triage in the first place; they are also the ones with options elsewhere. When one of them leaves, the rotation tightens onto the people who remain, and the cycle accelerates. The product gets slower to ship features because the people who would build them are recovering from being paged, and the people who join are inheriting an alert backlog they did not build and cannot easily tune.

The business cost runs in parallel. Without an SLO, every reliability conversation between engineering and product becomes a debate without a referee. Engineering says "we should pause feature work to address reliability"; product says "reliability looks fine, ship the feature." Both are arguing from anecdote because no number is in the room. The conversation never resolves the same way twice, priorities oscillate, and neither side trusts the other's framing. An SLO settles the conversation by making reliability a metric the entire organisation can read.

The further cost is reputational, and it is harder to see. Customers do not write churn-survey responses describing an SLO breach; they describe a feeling — "the product was getting unreliable" — and they leave. By the time the qualitative signal is visible in retention numbers, the underlying pattern has been visible in the telemetry for months. Without an SLO, the telemetry was never compared to a target, so nobody knew the pattern was a problem.

What An SRE And Observability Framework Actually Is

The framework is the set of disciplines that turn reliability engineering into an operating model the team can hold. It has four constituent parts. Each plays a specific role; missing any of them collapses the others.

Service level indicators

The SLI is the measurement of the thing users actually care about. Not server uptime, not database CPU — the metric whose failure looks like the system being broken to the user. For an API, it is usually request success rate and latency at the 99th percentile. For a checkout flow, it is the success rate of the full transaction. For an AI feature, it is response success rate combined with latency and a hallucination indicator where applicable. For a mobile product the right indicator is normally crash-free session rate per OS version, which is why our mobile delivery practice sets that SLI before the first beta. The SLI is chosen by working backwards from the user's experience, not forwards from what the monitoring tool happens to collect.

Most teams already collect the underlying telemetry. The discipline is to pick a small number — typically two to four SLIs per critical user journey — and commit to them as the canonical measurement. Everything else stays in dashboards for diagnostic use.

Service level objectives

The service level objective is the target the SLI is measured against. "99.9% of API requests in the last 28 days succeeded within 400ms." "Checkout completed within five seconds for 99.5% of sessions across the rolling month." The SLO is a number, a time window, and a measurement — and crucially, it is agreed by the business as much as by engineering. The product owner signs off because the SLO determines feature velocity. The commercial sponsor signs off because the SLO is implicitly a customer commitment.

The SLO is deliberately lower than what the system can hit in its best state. A system running at 99.99% has more headroom than one whose SLO is set at 99.99%, because the SLO's job is to define "acceptable," not "best." The gap between current performance and the SLO is the error budget.

Error budgets

The error budget is the operating mechanism that changes team behaviour. If the SLO is 99.9% over 28 days, the error budget is 0.1% — about 40 minutes of acceptable degradation, or the equivalent in elevated error rates, in any 28-day window. While the budget is intact, the team ships features. When the budget is at risk of being burned, the team prioritises reliability work over feature work. When the budget is exhausted, feature releases pause until the system returns to target.

This is the mechanism that ends the every-incident-is-a-P0 dynamic. A small latency blip well within the error budget does not need an after-hours response — it gets logged and reviewed in the next ops meeting. A burn-rate alert that says the budget will be exhausted within the week is a P0. The team's energy is rationed against the SLO instead of against the loudest alert.

Alerting, on-call, and incident response

Once the SLO and error budget are defined, the alerting policy follows from them. Alerts fire on burn rate — the rate at which the error budget is being consumed — not on raw thresholds. A burn-rate alert tuned to fire when the current consumption would exhaust the 28-day budget in two hours is the alert that wakes someone at night. A slower burn rate raises a ticket for daytime triage. The on-call engineer responds to a small number of high-signal pages instead of a large number of low-signal ones, and the runbook for each alert is written before the alert can fire — covering identification, immediate mitigation, escalation path, and the steps to confirm the burn rate has stabilised.

The incident response loop closes with a blameless post-mortem. The post-mortem produces three things: the contributing factors, the changes that will reduce the chance of recurrence, and an update to the SLO or alerting policy if the incident exposed a gap. Reliability work coming out of post-mortems is sequenced against the error budget — if it is healthy, the work goes into the backlog; if the budget is at risk, it pre-empts feature work.

The output of building the framework is a documented reliability posture: SLIs that matter for each critical user journey, SLOs and time windows, error budgets and how they are tracked, the alerting policy, the on-call rotation and runbooks, and the post-mortem cadence. It gets reviewed at the Production Readiness Review™ before launch. The room that builds it includes the engineering lead, the DevOps or SRE role responsible for production, the product owner (the SLO is a product decision), the on-call engineer who will field the alerts, and the commercial sponsor whose customer commitments the SLO implicitly defines. Including the commercial sponsor at design time is what prevents the most common failure mode: an SLO set by engineers alone, undermined by product launches that knowingly burn the budget, and ignored by the business when it becomes inconvenient.

Even with the framework in place, three failure modes recur. SLOs set against the system rather than the user — measuring API uptime when the user cares about checkout — produce green dashboards during customer-visible outages. Error budgets defined but not enforced — feature launches proceeding while the budget is exhausted — turn the SLO into a vanity metric. And alerts wired to channels with no owner turn the whole structure into theatre.

How To Set The Framework Up Without Stalling Delivery

The discipline does not need a separate workstream. It sequences alongside the normal pre-launch arc and reaches operating state at the Production Readiness Review. A four-to-six-week build runs in parallel with the late stages of feature development.

The first step, ideally in the same week the team starts cutting code that will go to production, is picking the SLIs. Map the two or three critical user journeys — for a payments product, that is authentication, payment authorisation, and balance lookup; for a SaaS product, it is sign-in, the primary feature flow, and any export or reporting customers depend on. For each, pick the metric that fails when the user's experience fails: success rate, latency at a percentile, or both. Resist the urge to instrument every endpoint; the SLI set is small by design.

The second step is setting the SLOs and getting business sign-off. This is a conversation, not an engineering decision. The team brings a starting proposal — "99.9% of authentication requests under 200ms across a 28-day rolling window" — and walks through what each digit costs. 99.9% gives about 40 minutes of monthly error budget; 99.99% gives about four. The cost of higher reliability scales non-linearly. The business sponsor agrees the target with eyes open, knowing it constrains feature velocity in proportion. The agreed SLO is recorded with the date, the signatories, and the review cadence.

The third step is wiring the error budget into the operating cadence. The current burn rate gets a dashboard the team reviews in every weekly engineering meeting. The alerting policy is rewritten so pages fire on burn rate rather than raw thresholds — typically a fast-burn alert for the on-call rotation and a slow-burn alert for next-business-day triage. The runbooks are drafted by the engineer who will respond to them, not the engineer who built the system; if the on-call engineer cannot follow the runbook at 2am, the runbook is wrong. This sequencing slots inside the structured cadence of our project delivery framework so the reliability work moves with the rest of the engagement rather than as a separate stream.

The fourth step is the Production Readiness Review. The SLO, error budget, alerting policy, on-call rotation, and runbooks are checked together. A system without an agreed SLO does not pass — not because the system is unsafe, but because the operating model that would catch failure is missing.

What to avoid: setting SLOs in isolation from product. An SLO engineering owns and product never signs off on is broken at the first feature deadline. Setting SLOs too high. An SLO of 99.99% is rarely justified for a product that has not earned five-nines telemetry; it produces an error budget the team will burn through monthly, training everyone to treat the metric as noise. Skipping the burn-rate translation in alerting. Burn-rate alerts are the technical mechanism that changes alert volume; without them, you have written an SLO and kept the old on-call experience.

The framework depends on broader observability — structured logging, metrics, distributed tracing — and on the CI/CD pipeline that gates merges. It pairs with the Production Readiness Review and the Runbook Library; neither is fully effective without the others. Treating the SRE framework as a standalone project rather than as part of a coherent DevOps and infrastructure delivery practice produces SLOs nobody operates by.

How A Burning-Out Team Re-Found Its Cadence

A mid-sized SaaS client (engineering team of around twelve, product in customer-facing production for nine months) we worked with came to us with a measurable problem and a felt one. The measurable problem was that the on-call rotation had become a retention risk — two of the four engineers in rotation had left in the previous quarter, the remaining engineers were carrying alerts on a four-day rotation, and the new hires had not been ramped because the existing engineers were too tired to do the training. The felt problem was that the product team and engineering had stopped trusting each other's reliability framing; every prioritisation conversation defaulted to argument.

The first week was diagnostic. Every page in the previous month was reviewed against a single question: did this represent a customer-visible degradation, or did the system handle it transparently? Roughly two thirds had been transparent — retries succeeded, no customer reported anything, no support ticket landed. The team had been responding to events its own infrastructure was already absorbing.

The second and third weeks were SLO work. Three SLIs were chosen — sign-in success and latency, the primary feature flow's success rate, and the reporting export's completion rate. Initial SLOs were set conservatively: 99.5% across 28 days, with internal stretch targets to 99.9% as the team's reliability work compounded. The product head and the CEO signed off in a 45-minute meeting because the trade-off — feature velocity in exchange for an agreed reliability floor — was clearly framed.

The fourth week was alerting and on-call redesign. Burn-rate alerts replaced raw threshold pages. The rotation widened from four engineers to six because two of the new hires could now take fast-burn pages against a tested runbook. Sub-burn-rate events stopped paging entirely; they landed in a daily triage channel.

Three months later, after-hours pages were down by more than half, no further engineers had left, and the SLOs were holding within the agreed bands. The product team and engineering ran their next quarterly planning meeting using error-budget burn as one input to scoping; the argument-without-a-referee dynamic had ended.

When This Matters Most, And When You Can Defer

The framework is critical the moment the product is in revenue-bearing customer hands, the moment downtime carries reputational consequences, and the moment the engineering team has more than one engineer on rotation. It is critical for systems with contractual reliability commitments — SLAs published to customers, integrations with partners who have their own reliability dependencies, regulated workflows where the audit trail of incident response is part of the compliance evidence. For payments, healthtech, and any system that interacts with a regulated record, SLOs are non-negotiable from launch.

The framework can be deferred for genuine internal tools used by a small team that can tolerate occasional disruption; for proofs-of-concept explicitly scoped to be retired or rebuilt; and for the very early prototype phase where the question is "does this work at all" rather than "does this work reliably." Even there, picking the eventual SLIs at design time is cheap; what defers is the formal SLO commitment and the operating cadence around it.

The category that catches teams out is the internal tool that quietly becomes load-bearing. A product built for a small audience starts to underpin a critical workflow, and "we don't need SLOs for this" becomes the framing on the day a senior stakeholder is in a meeting watching the tool fail to load. If a system is on a trajectory to matter, the framework should be in place before it does.

What To Do Next

If your team is responding to alerts but cannot articulate what your reliability target actually is, the most useful next step is a one-hour session with the engineering lead and the product owner to draft a first SLO for the single most critical user journey. The number does not need to be perfect; it needs to exist. Once the draft exists, the rest of the framework — error budget, burn-rate alerts, runbooks — has somewhere to anchor. For the broader picture of how this work sits inside a structured engineering practice, see how we approach custom software delivery end to end. The next BTL component most teams need alongside this one is the Runbook Library — the operational artifact that makes SLO-driven on-call navigable for whoever answers the page.

Frequently Asked Questions

What's our reliability target supposed to be?

A specific SLO chosen by working backwards from the user's experience and forwards from what the business is willing to pay for. For most commercial products, 99.5% to 99.9% across a 28-day rolling window is a defensible starting point for the critical user journey, with the precise number set in a sign-off conversation that includes the product owner and the commercial sponsor. The trade-off is explicit: each additional nine of reliability roughly costs the same again in operational discipline. Start with a number the system can demonstrably hold and tighten it as the team's reliability engineering work compounds.

How much downtime is acceptable?

The amount your error budget says it is. A 99.9% SLO across 28 days gives you roughly 40 minutes of acceptable degradation; a 99.95% SLO gives you about 20. Acceptable does not mean ignored — it means the team treats it as background the system has absorbed, rather than a 3am page. The business sign-off on the SLO is implicitly a sign-off on the error budget, so the conversation about acceptable downtime happens once, calmly, at design time — not in the middle of an incident.

When should we alert, and on what?

On burn rate against the SLO, not on raw infrastructure thresholds. A fast-burn alert — the error budget would be exhausted within hours at the current rate — pages the on-call engineer immediately. A slow-burn alert — the budget would be exhausted before the SLO window closes — raises a ticket for daytime triage. Raw thresholds on CPU, memory, or per-endpoint latency stay in dashboards for diagnostic use; they should not page humans by default.

What's an error budget?

It is the allowed failure rate implicit in your SLO, expressed as a budget the team operates against. If your SLO is 99.9% success across 28 days, your error budget is 0.1% — the maximum amount of failure the system can produce while still meeting the SLO. The budget changes team behaviour because it sets the priority rule: while the budget is healthy the team ships features; as the budget approaches exhaustion the team prioritises reliability work; once exhausted, feature releases pause until the system returns to target.

How is this different from the monitoring we already have?

Monitoring tells you what is happening; the SRE framework tells you whether what is happening is acceptable. Two teams with identical dashboards can have completely different on-call experiences — one because their alerting fires on threshold breaches, the other because it fires on burn rate against a defined SLO. The framework is the agreed reliability target plus the incident response model that surrounds it, not the telemetry underneath. This is also why observability without SLOs tends to produce the support-ticket-first failure pattern that becomes most visible in the period after launch.

Who owns the SLO when it gets uncomfortable, and who handles on-call?

The product owner and the commercial sponsor who signed off on the SLO own it jointly with engineering. The SLO is a customer commitment as much as a technical target. When the error budget is exhausted, the product owner decides whether feature work pauses or the SLO needs renegotiation. On-call sits with the build team — engineers responding to alerts they tuned and runbooks they wrote — because separating response from build produces the longest mean time to repair. Where internal capacity is thin, an embedded augmented engineering team can carry on-call without breaking the build-team-owns-it principle.

Does the framework apply to AI systems differently?

The mechanics are the same — SLIs, SLOs, error budgets, burn-rate alerts — but the SLIs differ. For an AI feature the SLI is not just success rate and latency; it includes accuracy on a versioned benchmark, hallucination rate where applicable, and consistency. The AI evaluation framework feeds the SLI; the SLO is then set against the evaluation output the same way an API's SLO is set against request telemetry. Production AI systems are increasingly held to this standard under the EU AI Act, ISO 42001, and the NIST AI Risk Management Framework for higher-risk use cases. The full pattern is covered in how we deliver agentic AI.

Laxmi Hari Nepal

Laxmi Hari drives Agile transformation, coaching teams to adopt Agile methodologies, instill values, and foster a culture of continuous improvement.