Branching Out from MTTR: Tracking MTTD, MTBF, and MTTA

November 21, 2022

SLA credits and negative customer sentiment. Lost deals and renewals. Team morale in a time when work/life balance is top of mind. The cost of incidents, both direct and indirect, can be difficult to truly measure. I imagine the recent Taylor Swift ticket debacle will leave Ticketmaster learning this all too well.

Any organization delivering a SaaS service utilizing a complex system of cloud providers and technologies knows that eliminating incidents altogether is a pipe dream. That’s why it’s critical to focus on reducing the impact of incidents when they occur. I recently wrote an article highlighting Incident Management metrics such as MTTR as a great starting point for many teams. In this article, I’m going to cover three additional Incident Management metrics and how they can help you reduce incident costs.

Incident Management Metric: Mean-Time-To-Detect (MTTD)

MTTD measures the amount of time on average it takes to detect an incident is occurring over a given time window. Simply put, just because you don’t know a problem exists doesn’t mean it isn’t impacting your customers. The faster you know about the problem, the faster you can diagnose and resolve it.

For example, imagine your team had 4 incidents over a 1-week time window. Three of those incidents were detected within 10 minutes of the problem occurring (30m total), but one of those incidents took 4 hours to detect because it didn’t trigger an alert until there was a significant increase on system load. MTTD for that week would be calculated as 4h30m / 4, with a conversion of 67.5 minutes.

Many teams turn to alerting best practices, providing real-time visibility of inbound customer complaints to engineers, and dogfooding to help reduce MTTD.

Incident Management Metric: Mean-Time-Between-Failures (MTBF)

MTBF is best described as the average amount of uptime between failures over a given time window. It’s one of the few Incident Management metrics where a higher average duration is desired as it means better service reliability. Your organization should define failure and consider service complexity when identifying an attainable uptime goal.

For example, there were 3 degraded states and 0 entire outages observed over a 1-week time window (168h). The total amount of time between degradation and a return to full health across all 3 incidents was 5 hours. MTBF would be calculated as (168-5) / 3, with a conversion of approximately 54 hours. If we define failure as a complete outage instead of a degraded state, then our attainable goal would be closer to 168 hours.

The good will you build with customers can be entirely wiped out by one bad incident. Focusing on MTBF will help you build more customer good will before the next incident occurs.

Incident Management Metric: Mean-Time-To-Acknowledge (MTTA)

MTTA measures the average amount of time it takes for an alert to be acknowledged by the proper team member(s) and work begins on diagnosing the issue. Tracking this Incident Management metric can help identify areas of improvement related to alert fatigue, potential gaps in on-call coverage, or how alerts are surfaced to team members.

For example, there were 6 incidents in the last week. Five of those incidents were acknowledged in 4 minutes. Unfortunately the on-call engineer missed the first page for the 6th incident so it took 9 minutes in that case. MTTA would be calculated as ((5×4)+9) / 6, with a conversion of just under 5 minutes.

While MTTA requires an alert for a starting point, that doesn’t mean it only comes from a robust alerting system. In many cases, customers may alert you to a problem before an alert triggers. Therefore, evaluating how your Customer Success team engages with your customers and surfaces issues to the engineering team during an incident can also potentially reduce MTTA.

The Root Of The Problem

Incidents will continue to occur as long as the 4 C’s (changes to Code, Configs, Customer behavior, and/or Clouds) exist. Rather than attempting to eliminate incidents altogether, focusing on Incident Management metrics can help you reduce the impact they have resulting in lower direct and indirect costs on your organization.

A recent DEJ survey showed that 76% of all performance problems can eventually be traced back to changes in an environment. Yet 67% of responding organizations stated they didn’t have the ability to identify change(s) in their environment. At CtrlStack, we’re drastically reducing the time it takes to diagnose incidents by connecting cause and effect. If you’d like to learn how CtrlStack can help your team significantly reduce the direct and indirect costs of incidents, then signup for our beta today!

About Author
Jason Goocher
Founding CS Engineer