To minimize the impact of degraded service on your value stream, there should be as little downtime as possible. If it’s taking your team more than a day to restore services, you should consider utilizing feature flags so you can quickly disable a change without causing too much disruption. If you ship in small batches, it should also be easier to discover and resolve problems.
Although mean time to discover (MTTD) is different from mean time to recovery, the amount of time it takes your team to detect an issue will impact your MTTR—the faster your team can spot an issue, the more quickly service can be restored.
As With lead time for changes, you don’t want to implement sudden changes at the expense of a quality solution. Rather than deploy a quick fix, make sure that the change you’re shipping is durable and comprehensive. You should track MTTR over time to see how your team is improving and aiming for steady, stable growth.
Change Failure Rate
Change failure rate (CFR) is the percentage of releases that result in downtime, degraded service or rollbacks, which can tell you how effective a team is at implementing changes. As you can see, there is not much distinction between performance benchmarks for CFR:
Elite performers: 0-15%
High, medium and low performers: 16-30%
Change Failure Rate is a particularly valuable metric because it can prevent a team from being misled by the total number of failures they encounter. Teams who aren’t implementing many changes will see fewer failures, but that doesn’t necessarily mean they’re more successful with the changes they do deploy. Those following CI/CD practices may see a higher number of failures, but if CFR is low, these teams will have an edge because of the speed of their deployments and their overall success rate.
This rate can also have significant implications for the value stream: It can indicate how much time is spent remedying problems instead of developing new projects. Because high, medium and low performers all fall within the same range, it’s best to set goals based on the team and the particular business rather than compare to other organizations.
Putting it All Together With DORA Metrics
As with any data, DORA metrics need context, and one should consider the story that all four of these metrics tell together. Lead time for changes and deployment frequency provide insight into the velocity of a team and how quickly they respond to the ever-changing needs of users. On the other hand, mean time to recovery and change failure rate indicate the stability of a service and how responsive the team is to service outages or failures.