Site Reliability Engineering (SRE) Comes of Age in 2022

The site reliability engineer (SRE) role is still gathering steam across organizations. In January 2022, LinkedIn listed SRE as the 21st job with the highest global demand throughout the past five years. That’s pretty high for such a specific tech role. And, looking to the future, it appears the SRE practice will only continue to gain adoption as a method to support high availability, reliability and improved digital customer experiences. The SRE approach is also essential to meet service-level agreements (SLAs) and internal service-level objectives (SLOs).

I recently met with Kurt Anderson, SRE architect, and Emily Arnott, content writer from Blameless to get their take on the top trends and predictions for the SRE practice in 2022. According to the Blameless team, 2022 will likely see increased adoption of the SRE role throughout companies and across internal departments.

Just as manufacturing sets high standards for its quality assurance, digital applications now require the same high reliability to meet inflated user expectations. To address these new realities, SREs have their work cut out for them—they must confront how to unify disparate DevOps tools and must drill farther down to unearth specific user experience qualms—all without adding to alert fatigue.

Increasing SRE Adoption

As mentioned above, the SRE role has seen tremendous growth in recent years and doesn’t show signs of slowing. SRE as a practice was first popularized by Google in 2015. Today, they define it simply as “what happens when you ask a software engineer to solve an operational problem.” This equates to maintaining high levels of reliability for digital services by creating errors budgets, designing SLAs/SLOs and optimizing process automation over time.

To Arnott, we’re still seeing SRE adoption take off at an exponential rate as more and more companies embrace Google’s SRE doctrine. “During the pandemic, more and more companies were forced to adopt a digital strategy. Reliability has to be fundamental to that strategy.” Arnott foresees SRE becoming a fruitful role at startups, which may initially have been intimidated by the prospect.

“The practice of SRE is a scale issue,” Anderson explained. It can be difficult for time-pressed companies to enact an SRE program when their attention is already so divided. However, as new requirements force warp-speed innovation, we’re seeing greater DevOps adoption. And the SRE role, consequentially, is becoming more of an imperative.

Most DevOps professionals would agree—we live in a multi-tool world. For example, multi-cluster management concerns are bringing a plethora of new tools to the fold. And computing environments are just as diverse, often combining multiple clouds and composed of hybrid estates that house various clusters of microservices. Even within the same organizations, different teams may opt for differing DevOps tooling. Anderson believes that SREs can bring “more unification to a fractured DevOps adoption across a single company.”

As companies adopt more software, there’s inevitably some degree of tool sprawl, Anderson said. For example, there may be competing methods to monitor the use of digital services. The platform team may be using Datadog, an app team may be using New Relic and another team may be using Prometheus. These tools don’t speak the same language and generate different logs, creating silos.

For Arnott, the future involves accepting the innate complexity of the microservices era and using an agnostic layer that can receive data from anywhere to identify meaningful incidents.

Observability Gets a Bigger UX Emphasis

The SRE role is definitely a trend that has been growing and will continue to grow in importance, Anderson said. And as more organizations become turn digital, it’s necessary; with this shift comes enormous pressure to exceed user experience expectations. “Competition and people’s expectations are so high right now,” said Arnott. “If you are unliable, it’s unacceptable these days.”

Whereas DevOps focuses more on getting applications into production, SRE takes a more UX-focused approach. Take Netflix, for example, which has invested heavily in a core SRE team that ensures robust and reliable user-facing services. “A key underpinning to keeping our customers happy and streaming is a strong focus on reliability,” according to the Netflix Technology Blog.

To cater to this new paradigm, teams will require a deeper understanding of observability based on particular cohorts of users, said Arnott. SREs require precise user experience contexts, such as an uptick of lags or increased error responses. But they also must discover the contributing factors behind these bugs. “SLOs and error budgets are like taking your temperature—they tell you if something is wrong. They don’t tell you why you have a fever.”

SRE Takes Cues From Manufacturing

As reliability becomes more and more central to a company’s brand presence, it’s only natural that something that important would be elevated into the executive suite, Anderson explained. As a reference example, he pointed to the fact that manufacturing industries often appoint a chief reliability officer. In traditional manufacturing, product lines underwent an statistical process control with rigorous testing to ensure quality. Analogous positions exist across industries, such as chief quality officers in financial services.

The takeaway is that as reliability becomes more of defining competitive metric, tech companies may require a similarly high-level C-suite position to align on that idea and to direct teams to implement SRE practices. Of course, the necessity for such a title will entirely depend on organizational size and makeup.

Greater Reliance on SLOs to Avoid Broken SLAs

A service-level agreement (SLA) is typically a formal contract built around the reliability of a digital service with an external partner, which, if breached, may result in financial penalties or other forms of restitution. This agreement could exist between a software provider and an end user but is more strictly enforced in a B2B context.

On the other hand, a service-level objective (SLO) is more than an internal target. SLOs are often set at a lower threshold than SLAs. For example, perhaps an SLA ensures a user login fails no more than 10 out of one million attempts, but the SLO is set at two out of one million.

Setting internal SLOs helps keep engineers abreast of modest customer experience issues before they become a major incident that results in fines. “Some companies treat SLOs as just another alert; they should be representative of user experience,” said Anderson. It makes sense that as SRE initiatives begin to mature, companies will slowly start to adopt even more competitive goals to deliver superior user experiences and avoid unsatisfied partners.

Final Thought: Don’t Play The Blame Game

A big part of the SRE doctrine is the concept of blameless postmortems. Because when you’re in the middle of an incident, the worst thing to do is point fingers. “Blaming triggers a defense mechanism and when you’re in a defensive mode, you can’t respond well,” Anderson said.

Instead, it’s better to keep an open mind when performing these retrospectives—accept that what happened happened for a reason, and seek to discover that reason to make positive improvements. “What can we change at a systemic level to become more resilient?” inquires Arnott. “What is the origin of that mistake? What information was lacking? What communication should have happened to prevent this?”

Typically, accidents don’t occur intentionally or due to incompetence but rather due to holes in the process. By not playing the blame game, companies are better positioned to fix issues as they progressively release new features. Then, hopefully, with the right SRE approach in place, teams can advance their digital services to meet the high stakes of today’s modern user experience expectations.

Leave a Comment