Blog

Techstrong Group and CtrlStack: Minimizing the Troubleshooting Tax

January 13, 2023

In a panel discussion around the troubleshooting tax—how much time each engineer spends troubleshooting—CtrlStack’s Founder and CEO Dev Nag and Jason Goocher, Head of Customer Success, chatted with Mike Rothman about tips and ideas to minimize the troubleshooting tax to increase DevOps productivity.

The conversation centered around changes—code or configuration changes—happening in DevOps environments and how CtrlStack brings a new approach to observing cloud applications and infrastructure to reduce MTTR.

People do far more changes than they realize

Mike: In our recent survey, 24% of respondents said they’re making changes—code or configuration changes—daily. Are they creating problems?

Dev: There are two things to take away from the charts. The first thing is that these are self reported numbers, meaning this is what people kind of recall and say to themselves like I think I had two changes yesterday. But if you ask them to think about all the times they changed a feature flag, scaled up the system, performed a Kubernetes action, provision something, or changed the DNS or network rule, they’d realize that they didn’t do just one change yesterday.

People do far more changes than they realize, but they’re not sure what to categorize as a change. That’s the first problem. The second problem is that the trends have not been favorable for change velocity. We have more systems, more complexity, more dynamism, more external third-party services we rely on like Databricks, Confluent or Snowflake. When those third-party systems go down, you go down too because they’re part of your entire ecosystem. So when they make a change, you’re actually making a system change as well. Those things are hard to quantify. So I think we both underreport and underestimate the trend line that’s increasing the change of velocity. As a result, we see the outcomes or the symptoms, but we don’t understand the causes.

Tip #1: Improve change visibility

Mike: What are the first two or three things that you would recommend folks do to streamline the path to this visibility?

Dev: The first thing to do is make sure you instrument the application, the infrastructure, and the services. Make sure you have all those categories or parts of your system tagged and able to send data into a centralized data system. So you have a group of telemetry MELT— metrics, events, logs and traces. You need to have all of those data types and other ones working together because they have different ways of projecting down the data and showing you different aspects of the system. I think we’re pretty mature now as an industry on instrumentation. The hard part is connecting them. What log goes into which metric, and how do they tie together? We have an operational issue.

So the second thing I would say is for folks making changes, what does that change process look like? Is there an organizational consistency for recording changes?

The worst case is when folks make changes without telling anyone, and then a day later saying oh yeah I did this yesterday on Slack. So there’s no way to track that stuff in real time. Plus, no one knows how to interpret that sentence in Slack; it’s not tied to the exact entity and the exact time. The exact type of change is just too high level and too vague and so I think getting that precision and that kind of real-time process around change management is super important.

Jason: You got to have consistency across the way that you instrument your application and build on it. When you are launching a product and you want to get your platform out there for end users, it’s very important to make sure that you have a good source of truth across all of your teams and the different processes within your application. That way, it’s much easier to pinpoint these problems when they are happening.

Tip #2: Centralize production change

Mike: 15% of respondents in the survey said they’re centralizing production changes across all applications and infrastructure. 45% are moving in that direction so they understand that’s where they want to be. As you scale up your environment and as you have more teams embracing the DevOps approach, the more changes you’re going to have. That means the more issues you’re going to have, and the more time you’re going to spend on troubleshooting. What can we do to get that 40% that don’t understand the importance of centralizing telemetry data including change events?

Jason: When you have many different applications and services to manage, it is very hard to manage them across several different products/UIs. Expecting a developer or SRE to know where that information is captured and kept is very difficult, especially when you have to do so much context switching. The mental load that comes into having to connect all those dots and build that model in your head is mentally taxing when you don’t have a good understanding of where those paths lead in the first place.

Dev: We suggest trying to put as much burden on the system, and as little on the human operators, as possible to connect the dots.

Tip #3: Clean up the diagnosis phase of MTTR

Mike: 26% of folks in our survey said they have significant downtime—4-24 hours. How do we improve the outage scenario and improve MTTR?

Jason: There’s a lot of time wasted when it comes to MTTR. You’ve got the actual diagnosis part of it, but then there’s just wasted time getting the pages, getting the right people involved, making sure that you’re communicating the right spots, and knowing where to go to find the information. So there’s a lot of wasted time there.

So just cleaning up that process can obviously shrink MTTR down. One of the key things to shrinking this number is just giving folks a good starting point to begin their troubleshooting journey.

Tip #4: Use software that actually mimics how you’d investigate an issue

Mike: How is the CtrlStack platform unique to helping teams identify which change really correlated to which specific outage so folks have a good starting point?

Dev: If you look at what people do when they troubleshoot, they’re looking at two big buckets of possible cause: the when and the where.

The when is what happened recently? What happened around the same time? What’s a correlated event? Looking at the where, you might ask what else happened nearby? What was changed on that same machine or maybe one level up or downstream?

This is how CtrlStack tracks this detective work of troubleshooting. CtrlStack reflects both the when and the where processes in the data itself. The when is pretty easy. A lot of events have time stamps. You can kind of correlate very easily. That stuff has been around for a while. The hard part is the where. Tracking what’s connected to what else in real time and making that the foundation for a very efficient search through the entire application is our unique capability.

We actually build the connectivity between all the data right between the metrics and the logs saying here’s how they connect together. Here’s how the machines connect. So going from a metric on one service pool to a metric on a different service pool to talk to the first one is incredibly hard in a traditional metrics database.

We actually built that whole model from scratch to let you search from a symptom back to the root cause through this sort of graph. We try to make that entire process much much faster by putting in the data first and then automating the investigation on top of that data.

Tip #5: Start by looking at the data between the data

Mike: How do we start to really think about the tooling that we’ll need in order to get this level of visibility?

Dev: 99% of all data that’s collected and paid for and stored is never queried, never analyzed. So we’re paying a lot of money for a lot of data that doesn’t actually get used. So the next level of better algorithms, in part with generative AI and other kinds of techniques are coming on right now to tell you which data actually matters for your operations and which ones don’t?

I think it’s actually a missing category which is the data between the data. So this is something that we have not really collected or monetized before as an industry but tracking how these things connect together is the next level of value to make data more efficient and more usable.

So who’s making the connections now? It’s actually us right now. It’s the people on your team who are making connections and paying the cognitive load. So we’re saying it’s just a little bit of extra data—change data— and the connections that can make the whole process much cheaper and get your team back to what they should be working.

Jason: If you don’t know how to use that data and don’t know when the data is relevant, it’s just useless. You don’t know when to go back and reference it right? It’s like muscle memory.

You have to use things more and more on a regular basis to know where you should be using it in the future. And if you’re not using your metrics enough and you don’t know how to utilize them the right way to monitor the true health of your environment, you don’t go back and reference it.

To make things easier, you really need something that lets you capture everything but also tells you when you should be utilizing those different data sources.

This session has been edited for length and clarity. There’s more to the conversation. Catch the whole discussion here:

About Author