Blog

Change Failure Rate: How DevOps Teams can Improve CFR

March 28, 2023

Change is inevitable in development. Whether it’s updating a feature, fixing a bug, or adding a new functionality, changes are necessary to keep up with user needs and competition. But wIth constant change comes risk, leading DevOps teams to get serious about how they keep track of their progress. One of the most important of these metrics is the change failure rate (CFR), which measures the percentage of deployed changes that result in failure. 

What Is Change Failure Rate?

CFR is the percentage of changes made to an application or system that result in a failure or negative impact on their performance. It is a key metric for DevOps teams because it indicates the risk of deploying changes into production while reflecting the quality of those changes. Less failures, better quality. It is one of the four key DORA metrics which also include lead time for change, deployment frequency, mean time to restore (MTTR), and change failure rate. The measure is derived by dividing the number of failed changes by the total number of changes made during a specific time window across bug fixes, feature enhancements, infrastructure updates, and more. 

Why Does Change Failure Rate Matter? 

CFR is essential in that it provides a way to measure the reliability of software development and delivery processes. The higher the CFR the higher the risk for failures that lead to downtime, lost revenue, and general damage to reputation. Lower CFR signals a stable and reliable system which leads to better user experience, less churn, and higher revenue. 

According to the Accelerate State of DevOps Report by the DevOps Research and Assessment (DORA) team, organizations with a lower CFR have higher DevOps performance. The highest performing organizations have a CFR of less than 15% while underperforming organizations have a CFR of 46-60%. Additional findings from the report also show that a lower CFR is associated with faster time to restoration, higher deployment frequency, and lower change lead time. 

Big organizations with leading DevOps teams have put some interesting practices in place to help manage their CFR with great results. For instance, Netflix uses chaos engineering, intentionally injecting failures in their system to test resilience. The famous “Chaos Monkey” tool randomly shuts down servers in Netflix’s production environment to force teams to deal with the unexpected and develop systems that recover automatically.  What are practices that you can employ?

Best Practices for Managing CFR

Teams can  effectively manage CFR in their software delivery process with the help of these best practices:

  • Implement automated testing: Automated testing can help catch issues early in the development process, reducing the risk of deploying changes that introduce new bugs or errors.
  • Use continuous integration and continuous delivery (CI/CD): CI/CD can help ensure that changes are thoroughly tested and validated before they’re deployed to production, reducing the risk of failures.
  • Implement feature flags: Feature flags can enable teams to release changes in a controlled manner, allowing them to quickly turn off features if they’re causing issues.
  • Implement canary releases: Canary releases can help teams test changes in a controlled environment before rolling them out to a broader user base, reducing the risk of failures.
  • Use the right metrics and tools to track CFR: DevOps teams should use the right tools to track CFR and identify trends over time. This can help them identify areas of improvement and measure the effectiveness of their strategies. 

Three Ways to Improve CFR Immediately

While the practices above may take time and effort to implement, there are some things you can start doing now to improve your CFR immediately. 

1. Understand change impact 

Seeing “pre-incidents” and understanding how a change impacts a service’s upstream and downstream dependencies, such as Kubernetes and AWS services, allow developers to easily understand their changes and failures. Imagine having to dig through days or weeks of logs, metrics, and traces when something goes wrong. If you proactively trace forward to infrastructure and related service changes when committing code, it’s much easier and faster to troubleshoot and resolve problems.

2. Improve your MTTR

Deployment Frequency and Lead Time for Changes measure velocity while Change Failure Rate and MTTR measure stability. Reducing the time it takes your team to recover from failures caused by changes and tracking remediation solutions will help your team gain deep insights into the root causes of the failures. The faster you recover, the less time your system is down.

3. Track hotfix deployments

If your team is constantly relying on hotfixes to remedy issues, you can unintentionally break other parts of your software with a quick fix. Hotfixes address symptoms, but they don’t focus on the underlying issue or root cause. Even worse, most hotfixes aren’t documented. Knowing what, when, and who made changes to the code empowers your team to better document your resolutions which makes troubleshooting the root cause of future bugs easier.

 

Being able to manage CFR is critical to achieving DevOps success. How your team manages incidents that lead to failure is just as important. Proactively monitoring how broken code or configuration leads to failures can provide more holistic insights and can show where in the development and release cycle your team needs to improve. 

Start improving your change failure rate today! Schedule a demo now!

About Author
Mary Chen
Sr. Director, Product Marketing