Risk analysis basics
Before you can evaluate and prioritize your risks, though, you need to come up with a comprehensive list of things to watch out for. In this post, we’ll provide some guidelines for teams tasked with brainstorming all the potential risks to an application. Then, with that list in hand, we’ll show you how to actually analyze and prioritize the risks you’ve identified.
What risks do you want to consider?
When brainstorming risks, it’s important to try to map risks in different categories — risks that are related to your dependencies, monitoring, capacity, operations, and release process. And for each of those, imagine what will happen if specific failures happen, for example, if a third party is down, or if you introduce an application or configuration bug. Thus, when thinking about your measurements, ask yourself:
Are there any observability gaps?
Do you have alerts for this specific SLI?
Do you even currently collect those metrics?
Also be sure to also map any monitoring and alerting dependencies. For example, what happens if a managed system that you use goes down?
Ideally, you want to identify the risks associated with each failure point for each critical component in a critical user journey, or CUJ. And after identifying those risks, you will want to quantify them:
What percentage of users was affected by the failure?
How often do you estimate that failure will occur?
How long did it take to detect the failure?
It’s also helpful to gather information about any incidents that happened in the last year that affected CUJs. Compared with gut feelings, relying on historical data can provide more accurate estimates and a good starting point for actual incidents. For example, you may want to consider incidents such as:
A configuration mishap that reduces capacity, causing overload and dropped requests
A new release that breaks a small set of requests; the failure is not detected for a day; quick rollback when detected.
A cloud provider’s single-zone VM/network outage
A cloud provider’s regional VM/network outage
The operator accidentally deletes a database, requiring a restore from backup
Another aspect to think about is risk factors; these are global factors that affect the overall time to detection (TTD) and time to repair (TTR). These tend to be operational factors that can increase the time needed to detect outages (for example when using log-based metrics) or alert the on-call engineers. Another example could be a lack of playbooks/documentation or lack of automatic procedures. For example, you have:
Estimated time to detection (ETTD) of +30m due to operational overload such as noisy alerting
A 10% greater frequency of a possible failure, due to lack of postmortems or action item follow-up
Brainstorming guidelines: Recommendation for the facilitator
Beyond the technical aspects of what to look for in a potential risk to your service, there are some best practices to consider when holding a brainstorming session with your team.
Start the discussion with a high-level block diagram of the service, its users, and its dependencies.
Get a set of diverse opinions in the room — different roles that intersect with the product differently than you do. Also, avoid having only one party speak. Ask participants for the ways in which each element of the diagram could cause an error to be served to the user. Group similar root causes together into a single risk category, such as “database outage”.
Try to avoid spending too long discussing things where the estimated time between a given failure is longer than a couple of years, or where the impact is limited to a very small subset of users.
Creating your risk catalog
You don’t need to capture an endless list of risks; seven to 12 risks per Service Level Indicator (SLI) are sufficient. The important thing is that the data capture high probability and critical risks.
Starting with real outages is best. Those can be as simple as unavailability of <depended service or network>.
Capture both infrastructure- and software-related issues.
Think about risks that can affect the SLI, the time-to-detect and time-to-resolve, and frequency — more on those metrics below.
Capture both risks in the risk catalog and risk factors (global factors). For example, the risk of not having a playbook adds to your time-to-repair; not having alerts for the CUJ adds to the time-to-detection; the risk of a log sync delay of x minutes increases your time-to-detection by the same amount. Then, catalog all these risks and their associated impacts to a global impacts tab.
Here are a few examples of risks:
A new release breaks a small set of requests; not detected for a day; quick rollback when detected.
A new release breaks a sizable subset of requests; and no automatic rollback.
A configuration mishap reduces capacity / Unnoticed growth in usage hits max.
Recommendation: Examining the data/result of implementing the SLI will give you a good indication of where you stand in regard to achieving your targets. I recommend starting with creating one dashboard for each CUJ — ideally a dashboard that includes metrics that will also allow us to troubleshoot and debug problems in achieving the SLOs.
Analyzing the risks
Now that you’ve generated a list of potential risks, it’s time to analyze them, in order to prioritize their likelihood, and potentially find ways to mitigate against them. It’s time, in other words, to do a risk analysis.
Risk analysis provides a data-driven approach to address and prioritize the needed risks, by estimating four key dimensions: the above-mentioned TTD and TTR, as well as time-between failures (TBF), and their impact on users.
In Shrinking the impact of production incidents using SRE principles, we introduced a diagram of the production incident cycle. Blue represents when users are happy, and red represents when users are unhappy.
The time that your services are unreliable and your users are unhappy consists of the time-to-detect and the time-to-repair, and is affected by the frequency of incidents (which can be translated to time-between-failures).