Eric D. Schabell: O11y Guide: Cloud Native Observability Pitfalls

Friday, January 26, 2024

O11y Guide: Cloud Native Observability Pitfalls - Focusing on The Pillars

Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability?

When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing.

Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives that hasty decisions up front lead to big headaches very quickly down the road.

In the previous article, we looked at the problem of controlling cost in cloud native observability. In this article you'll find the next pitfall discussion that's another common mistake organizations make. By sharing common pitfalls in this series, the hope is that we can learn from them.

After laying the groundwork in the previous article, it's time to tackle a pitfall where we need to stop focusing on The Pillars. I've spent some time in the past talking about Three Phases to Better Observability Outcomes and published an initial take on why Cloud Native Observability Needs Phases, but this article will be a more in depth dive into the topic.

Focusing on The Pillars

For a few years now vendors have been marketing the idea that you need to focus on certain signals or pillars to achieve what you desire in the world of cloud native observability.

Focusing on tooling, not fixing the issue.

If you look more closely at this, they are pushing hard for you to concentrate on three pillars; metrics, logs, and tracing with a few even sliding in events to make it sound all encompassing. These are touted as things you can tangibly check a box on in your observability stack. What they end up doing is creating a focus on functionality and technology features while completely ignoring the problem at hand.

It's like we have a very nice and expensive car that we cherish and it's started to make funny sounds while emitting smoke when we are driving. We rush to our favorite garage and the mechanic listens to our issues, then proceeds to drag out their toolboxes to show off all the great tools they have to fix issues just like ours. While this is going on and on, we look out the window and see that our car is now not just smoking, but it's on fire!

When we ask our on-call engineers who are the front line in the war on keeping our cloud native business thriving, they will describe a process they have to go through to achieve that in all the various areas that they consider worth monitoring in our business.

Meanwhile, the issues get worse.

When talking about the process and how it's important to our business goals, we hear the business talking in phrases like:

Better business outcomes...

Faster remediation of problems that occur...

Easier problem detection...

Greater revenue generation...

Happier customers...

Engineering teams focused on delivering business value

These are all in a language the business understands and describes more the process that needs to be designed for, not the features the tooling needs to have. When we bring this back to cloud native observability, we want a solution for our on-call engineers that walk them through the following three phases:

Knowing - we start by discovering something is happening as fast as possible, maybe even leading to a quick fix in this phase.
Triaging - if unable to fix immediately, then we start triaging based on specific targeted information that is directly related to the problem at hand which then quickly leads to fixing it.
Understanding - finally, possibly at a later time and slower investigative pace, we need to have a very deep understanding of the issues encountered to ensure it never happens again.

We don't want to be confronted with visualizations that have been designed and grouping information as categorized signals or as the pillars. For example, here is something that was actually designed without much thought towards the process needed to solve any kind of issue, but it does capture the signals for you:

Good luck with this when you are on-call.

We really want to have clean, concise, and effective visualizations that present focused insights and put just enough information at our fingertips to make informed decisions quickly. We don't care if one metric, 3 labels, 1 span in a trace, and 3 log lines are the basis of the exact informational view we need to solve the reason our beeper went off:

Sharply focused insights with just enough information to get you through the phases.

The road to cloud native success has many pitfalls and understanding how to avoid the pillars, focusing instead on solutions for the phases of observability will save much wasted time and energy.

Underestimating impact of cardinality?
Coming up next

Another pitfall organizations struggle with in cloud native observability is underestimating cardinality issues. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud native observability efforts.

Below are the links to the other articles in this series: