Editor’s note: There’s more to ensuring a product’s reliability than following a bunch of prescriptive rules. Today, we hear from some Google SREs—Vartika Agarwal, Senior Technical Program Manager, Development; Tracy Ferrell, Senior SRE Manager; Mahesh Palekar, Director SRE; and Magi Agrama, Senior Technical Program Manager, SRE—about how to evaluate your team’s current reliability mindset, and what you want it to be.
Having a reliable software product can improve users’ trust in your organization, the effectiveness of your development processes, and the quality of your products overall. More than ever, product reliability is front and center, as outages negatively impact customers and their businesses. But in an effort to develop new features, many organizations limit their reliability efforts to what happens after an outage, and tactically solve for the immediate problems that sparked it. They often fail to realize that they can move quickly while still improving their product’s reliability.
At Google, we’ve given a lot of thought to product reliability—and several of its aspects are well understood, for example product or system design. What people think about less is the culture and the mindset of the organization that creates a reliable product in the first place. We believe that the reliability of a product is a property of the architecture of its system, processes, culture, as well as the mindset of the product team or organization that built it. In other words, reliability should be woven into the fabric of an organization, not just the result of a strong design ethos.
In this blog post, we discuss the lessons we’ve learned relevant to organizational or product leads who have the ability to influence the culture of the entire product team, from (but not limited to) engineering, product management, marketing, reliability engineering, and support organizations.
Goals
Reliability should be woven into the fabric of how an organization executes. At Google, we’ve developed a terminology to categorize and describe your organization’s reliability mindset, to help you understand how intentional your organization is in this respect. Our ultimate goal is to help you improve and adopt product reliability practices that will permeate the ethos of the organization.
By identifying these reliability phases, we do not mean to offer a prescriptive list of things to do that will improve your product’s reliability. Nor should they be read as a set of mandated principles that everyone should apply, or be used to publicly label a team, spurring competition between teams. Rather, leaders should consider these phases as a way to help them develop their team’s culture, on the road to sustainably building reliable products.
The organizational reliability continuum
Based on our observations here at Google, there are five basic stages of organizational reliability, and they are based on the classic organizational model of absent, reactive, proactive, strategic and visionary. These phases describe the mindset of an organization at a point in time, and each one of them is characterized by a series of attributes, and is appropriate for different classes of workloads.
Absent: Reliability is a secondary consideration for the organization.
A feature launch is the key organizational metric and is the focus for incentives
The majority of issues are found by users or testers. This organization is not aware of their long-term reliability risks.
Developer velocity is rarely exchanged for reliability.
This reliability phase maybe appropriate for products and projects that are still under development.
Reactive:Responses to reliability issues/risks are tied to recent outages with sporadic follow-through and rarely are there longer-term investments in fixing system issues.
Teams have some reliability metrics defined and react when required.
They write postmortems for outages and create action items for tactical fixes.
Reasonable availability is maintained through heroic efforts by a few individuals or teams
Developer productivity is throttled due to a temporary shift in priority on reliability work due to outages. Feature development may be frozen for a short period of time.
This level is appropriate for products/projects in pre-launch or in a stable long-term maintenance phase.
Proactive:Potential reliability risks are identified and addressed through regular organizational processes.
Risks are regularly reviewed and prioritized.
Teams proactively manage dependencies and review their reliability metrics (SLOs)
New designs are assessed for known risks and failure modes early on. Graceful degradation is a basic requirement.
The business understands the need to continuously invest in reliability and maintain its balance with developer velocity.
Most services/products should be at this level; particularly if they have a large blast radius or are critical to the business.
Strategic:Organizations at this level manage classes of risk via systemic changes to architectures, products and processes.
Reliability is inherent and ingrained in how the organization designs, operates and develops software. Reliability is systemic.
Complexity is addressed holistically through product architecture. Dependencies are constantly reduced or improved.
The cross-functional organization can sustain reliability and developer velocity simultaneously.
Organizations widely celebrate quality and stability milestones.
This level is appropriate for services and products that need very high availability to meet business-critical needs.
Visionary:The organization has reached the highest order of reliability and is able to drive broader reliability efforts within and outside the company (e.g., writing papers, sharing knowledge), based on their best practices and experiences.
Reliability knowledge exists broadly across all engineers and teams at a fairly advanced level and is carried forward as they move across organizations.
Systems are self-healing.
Architectural improvements for reliability positively impact productivity (release velocity) due to reduction of maintenance work/toil.
Very few services or products are at this level, and when they are, are industry leading.
Where should you be on the reliability spectrum?
It is very important to understand your organization does not necessarily need to be at the strategic or visionary phase. There is a significant cost associated with moving from one phase to another and a cost to remain very high on this curve. In our experience, being proactive is a healthy level to target and is ideal for most products.
To illustrate this point, here is a simple graph of where various Google product teams are on the organizational reliability spectrum; as you can see, it produces a standard bell-curve distribution. While many Google’s product teams have a reactive or proactive reliability culture, most can be described as proactive. You, as an organizational leader, must consciously decide to be at a level based on the product requirements and client expectations.