Google Explains Why Others Are Doing SRE Wrong
Source – infoq.com
Stephen Thorne, customer reliability engineer at Google, recently spoke at the DevOps Enterprise Summit London on what Site Reliability Engineering (SRE) is and how many organizations are failing to understand its basic premises and benefits . Confounding service level objectives (SLOs) which are focused on early failure detection with service level agreements(SLAs) which often serve as financial compensation for past incidents, not enforcing error budgetsand not dedicating at least 50% effort of SRE teams to improve the systems and tools (instead letting them continue to drown in toil, aka “firefighting” in production) are some keys misunderstandings that Thorne has seen in other organizations.
SLOs are fundamental in detecting issues early, ideally before the effects become visible to customers, added Thorne. A good SLO is aligned with outcomes for the customer (service availability or response time, for example) and thus reflects whether the system (behavior) is meeting user needs. Resource usage (like CPU utilization or network throughput) should be monitored, but not used as an SLO per se. Thorne put it simply as “if the customer is happy, then the SLO is being met”. Typical SLOs at Google include:
- uptime of 99.9% a month (i.e. 43 minutes of downtime a month)
- 99.99% of HTTP requests in a month succeed with a 200 OK
- 50% of HTTP requests returned in under 300ms
SLAs, on the other hand, typically come into play when customers are already dissatisfied with a service, thus failing to proactively improve the system’s reliability. Further, SLAs can lead to the wrong incentives, for example combining an SLA of 2 hours to fix an email issue with an SLA of 1 day to fix a serious production incident might lead to working on one (or more) email problems first although clearly the production issue should be the priority.
Just defining SLOs is not enough, Thorne warned. Error budget policies enable meeting SLOs by setting clear rules for action (not monetary compensation) before a system gets close to an SLO’s threshold. This also minimizes confrontation between ops and dev when systems are failing to meet user needs. “The error budget is the gap between perfect reliability and our SLO” said Thorne. For Google, a typical error budget policy is to disallow launching new features once an application has exhausted its error budget (for example, already over the 43 minutes downtime budget for this month), or dedicating a sprint to corrective actions stemming from previous post-mortem analysis.
Thorne stressed, however, that what works for Google won’t work for every organization: “SRE needs SLOs with consequences that balance an acceptable level of failure with the necessary cost and speed of delivery”. Exact SLOs and policies must be adequate for the organization – not a copy/paste from Google – and focus on continuously improving customers’ experience, not on setting lofty goals or hard punishments that could be counterproductive. Thorne gave the example of one organization struggling to reduce the processing time of a recommendations system. Turned out that users would only see those recommendations when they came back to the site, on average 6 hours later. An adequate SLO was to process all recommendations within 6h, which meant they could save the cost of 3 halftime engineers previously working on the perceived “issue” of slow response time.
Empowering SRE teams to balance workload between every day (often unplanned) ops work and planned work to reduce toil (aka “firefighting”) is the third key to SRE, said Thorne. At Google, this means at least 50% of SRE effort is spent on project work: early consulting on new systems’ architecture to identify resiliency anti-patterns (and avoid more toil later on), improving monitoring, automating repetitive tasks, or coordinating the implementation of postmortem corrective actions.
Thorne further referenced some clear anti-patterns to implementing SRE such as simply rebranding the ops team to SRE team or hiring for SRE engineers without first putting in place the SRE principles and mechanisms (SLOs, error budget policies and balancing workload) for success.
These are 5 key steps to get on the right path to SRE, according to Thorne:
- define contextual, customer-focused SLOs
- define sensible error budget policies
- hire (internally or externally) SREs and empower them via leadership support
- allow SREs to finetune SLOs and enforce error budget policies
- assign responsibility for mission-critical systems’ reliability to SRE teams, other systems under the responsibility of the corresponding development team
Google developed and expanded the site reliability engineering discipline internally for some years, before condensing their lessons learned into the SRE book. Thorne mentioned an accompanying SRE workbook will be coming out later this month.