Why Incident Management Needs Modernization
Incidents affect organizations in severe ways. Downtime or service unavailability can result in revenue loss, reputation loss, and customer loss. A study by Splunk found that the median downtime (MTTR) was more than five hours, and that service downtime can cost organizations an average of $7.9 million a year. The hard business cost of service-impacting issues will continue to climb as the complexity of cloud infrastructure rises. Sadly, soft costs, such as engineering burnout and change anxiety, are also on the rise.
The biggest problem in incident management is that the overall process and tooling are still very manual — from knowing that there’s a problem to having everyone communicating to triaging. On top of that, we throw everything at DevOps and SREs including various tools, data, and runbooks or guides during incident response. As a result, SREs and DevOps engineers lose a lot of time from context switching, trying to connect the dots (Is Service A talking to Service B?), and translating the issue to a specific team or service owner.
Incident management is still very manual
The incident management process is usually drawn from industry best practices which are adopted by organizations to fit their needs. Defining a clear incident management workflow is critical to resolving incidents faster and reducing costs. A thoughtful process should align teams to each stage of an incident and improve Mean Time to Resolution (MTTR). But current tools and disconnected data make it hard to resolve incidents quickly.
At each stage of an incident response (IR), there are many toolsets and data, which may be incomplete or just wrong to work with. You may not have sufficient change data (events) or an up-to-date diagram of your overall architecture, even runbooks may be lacking to troubleshoot efficiently. Common pain points that SREs and DevOps teams experience on a day-to-day basis are:
- Context switching and change control
- Lack of training on modern day tech like Kubernetes
- Knowledge capture and discovery
- Communication (people don’t communicate during IR)
- Useless runbooks when triaging production incidents
During IR (especially when you’re on-call), these pains are amplified, causing stress, anxiety, and eventual burnout.
TOOLS & DATA
Know the problem BEFORE your customers do.
Determine how bad this problem really is (What is the blast radius?) and determine the priority for the issue.
Sh*t happens, stop the bleeding. Do a hotfix or rollback and get as close to operational as possible.
Dig deeper to find the root cause of the problem and fix it.
Document what went well and what could have gone better. It's a learning process and an opportunity to improve your process, not a blame game.
Real-world example: Service fault at Google
To illustrate how incident response works in practice, let’s look at a postmortem shared by Google. This example shows what happens when a team of experts tries to debug a system with many interactions, and no single person can grasp all the details. Even with several teams of experts, the GKE CreateCluster outage took 6 hours and 40 minutes to fix.
Google Kubernetes Engine (GKE) is a Google-managed system that creates, hosts, and runs Kubernetes clusters for users. This hosted version operates the control plane, while users upload and manage workloads in the way that suits them best.
When a user first creates a new cluster, GKE fetches and initializes the Docker images their cluster requires. Ideally, these components are fetched and built internally so the team can validate them. But because Kubernetes is an open source system, new dependencies sometimes slip in through the cracks.
An on-call SRE for GKE declared an incident when verified CreateCluster probe failures were occurring across several zones; no new clusters were being successfully created.
Here’s the timeline of the incident:
- 7 a.m. (Assessed impact). On-call SRE confirmed users were affected by the outage.
- 9:10 a.m. So far, the incident responders knew the following:
- Cluster creation failed where nodes attempted to register with the master.
- The error message in cluster startup logs indicated the certificate signing module as the culprit.
- All cluster creation in Europe was failing; all other continents were fine.
- No other GCP services in Europe were seeing network or quota problems.
- 9:56 a.m. (Found possible cause). Two team members identified a rogue image from DockerHub.
- 10:59 a.m. (Bespoke mitigation). Several team members worked on rebuilding binaries to push a new configuration that would fetch images from a different location.
- 11:59 a.m. (Found root-cause and fixed the issue). SRE on-call disabled GCR caching and purged a corrupt image from the European storage layer.
Google’s diagnosis approach
The team had several documented escalation paths which helped to quickly get domain experts onboard and communicating. But logging was insufficient for diagnosis. In fact, digging into logs and finding the error message at the start distracted the team. It turned out that corruption on DockerHub was not the real issue. And a handful of first responders pursued their own investigation without context and coordination.
A better approach: Incident diagnosis with a DevOps graph
In this case, the responders would have benefited from a platform that connects cause to effect to find the root cause quickly. This platform would present the DevOps graph in an architecture topology that shows the connections/relationships between the entities such as the Docker image running in a Kubernetes cluster that’s pulled from DockerHub, or in this case, Google Container Registry (GCR). You would also see where in the timeline did the Docker configuration change and what changes were made (no need to manually construct the timeline). Being able to see the configuration changes and the relationships between the entities, in relation to the incident, in one place would have expedited the diagnosis and resolution of the incident.
How can CtrlStack help?
Companies have many monitoring and observability tools and talented teams to keep systems reliable and operational. And yet, the median time to respond and resolve an incident is hours or even days. Our mission is to help teams reduce their cognitive overload and troubleshooting time by 50% by better preparing them with context and dashboards that focus on what matters most.
With a platform like CtrlStack, all the actions taken by team members in the system are automatically captured. This ensures that anyone jumping onboard will have the right context, can see what has been done, when, and by whom. Nothing will be lost in translation. Most importantly, teams don’t have to spend more time writing an epic postmortem after having spent almost 7 hours resolving the issue. Major incidents requiring a coordinated response between multiple teams can be very stressful; the additional workload of postmortems can cause even more stress.
Want to learn more about CtrlStack and how we make it easy for you to troubleshoot production incidents? Check out this article and the accompanying demo video.
Feel free to reach out for a demo to learn more about how CtrlStack can help you reduce your MTTR. Our experts will give a walk-through of the features that can uplevel your team’s skill set and close the knowledge gap.