Introducing CtrlStack: Connecting Cause and Effect for Cloud Apps
DevOps is all about cause and effect. And cause and effect in modern cloud applications have scaled far beyond our ability to connect and understand them. Today, I’m proud to announce the beta preview of CtrlStack — the first observability platform which connects cause to effect for cloud applications in order to shorten and prevent downtime.
OBSERVABILITY AT A CROSSROADS
Observability is at something of a crossroads. We have more data types, and better scalability, and more observability vendors to choose from than ever before. I myself founded a leading observability company called Wavefront, which was adopted by companies like Microsoft, Workday, Snowflake, and Lyft. Wavefront was ultimately acquired by VMware, where it became the heart of their observability portfolio. But as I saw firsthand over the last decade, all of this data and choice didn’t actually led to the outcomes we expected.
We thought managing cloud apps would get easier. We thought downtimes would get shorter. We thought knowledge transfer would get more efficient. But that’s not what happened. The architectural forces driving modern applications utterly swamped whatever advantages that metrics and events and traces could eke out.
In fact, the data shows that companies which adopt more DevOps techniques actually have a higher engineering burden than those which adopt fewer. Of the companies which deploy at least once a day, a stunning 73% say that at least half of their engineering resources are spent on troubleshooting/debugging, and not feature development or strategic velocity.
Why is this happening?
Because the observability solution space just doesn’t match the problem space. The problem space — the apps that we’re running — are complex distributed systems with highly dynamic interactions, decentralized change management, and black-box 3rd-party dependencies. The solution space, on the other hand, is full of disconnected measurements which optimize for data depth within a component rather than modeling the relationships between components. This matched the problem space really well ten years ago ; today, not so much.
WE SHAPE OUR TOOLS AND OUR TOOLS SHAPE US
The best way to understand what the solution space could be is to watch what operators actually do to make up for the shortcomings in these tools. We’re the adaptable element in any workflow, and how we fill the gaps in our current tools often leads the way to the next insight, the next set of tools.
We all do two really important things in operating cloud apps which observability doesn’t support well —we think about the when, and the where, to help model cause and effect.
When something goes wrong, we ask when it happened and what happened just before…especially any changes that we made. And we ask where it happened, and what else happened nearby in the system…especially any changes that we made. See a pattern? It turns out that the vast majority of operational incidents (about 75%) are caused by our own changes, and we’re abundantly aware of our own culpability. The fault is not in our stars, but in that last kubectl command.
Both of these processes — understanding the when and the where — are fractured in legacy observability tools. It’s hard to figure out what just happened, because that information is scattered across team chat channels, application and service logs, random bash histories, ticketing systems, internal documentation, and sometimes none of the above. Just piecing it together is a crime scene investigation.
And it’s hard to figure out what happened nearby within the system, because that dependency architecture can be incredibly complex and dynamic, and it can take a long time to learn it for a single static app, much less keep up with changes. (Traces are great for request traffic, but not for other types of network dependencies, much less the much larger universe of non-network dependencies).
CtrlStack explicitly models both of these workflows with a unified change timeline and a dependency graph. We capture changes at many different levels of cloud applications with lightweight integrations, and then merge them into a single, real-time, filterable, high-performance timeline. At the same time, we build a real-time dependency graph of the entire system every few seconds and then situate every change within that graph, along with traditional observability data. The timeline, and the graph — the when, and the where.
With these two data structures, mimicking the (de facto) human process, the platform can do some incredible things. One of our key flows is being able to right-click on any spike in any metric chart and showing the most likely root causes of that behavior change in just a few seconds.
In one of our demos, we show a single metric chart with two metrics spiking at the exact same time because of two completely different root causes — and our automated troubleshooting can instantly differentiate them. Using the when, and the where.
I skipped over an important capability there — did you catch it? CtrlStack doesn’t just build this unified timeline and dependency graph, and then force users to manually navigate it in the heat of battle; we’ve automated both of those searches underneath a single click, in a scripting language accessible to our users. Once you have a natural data representation for a given manual process, the process can often be turned into code. “Software eating the world” is really this interplay between representation and process, between data and code.
At CtrlStack, our vision is DevOps-as-Code; we believe that far more of manual DevOps can and should be turned into user-accessible code. We’re seeing the beginning of this trend in GitOps and even Infrastructure-as-Code. If the timeline and the graph — the when and the where — are the natural representation for troubleshooting, then code itself is the natural representation for operational knowledge: creation, sharing, discovery, execution. There are a number of barriers to DevOps-as-Code, but we’re systematically attacking each one. Techniques like Robotic Process Automation (RPA) and Generative AI (as you can try for yourself at PromptOps.com) will drive the next generation of DevOps-as-Code operations.
CtrlStack currently supports EKS-based and Kubernetes-based applications within AWS. As mentioned above, we’re in beta preview, and are working with teams that are medium-to-heavy users of observability. If you’d like to find out how CtrlStack could transform your incident response today by connecting cause and effect, contact me at [email protected] or sign up here.