Eric D. Schabell: O11y Guide: Cloud Native Observability Pitfalls

Tuesday, January 30, 2024

O11y Guide: Cloud Native Observability Pitfalls - Underestimating Cardinality

Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability?

When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing.

Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives that hasty decisions up front lead to big headaches very quickly down the road.

In the previous article, we talk about how focusing on cloud native observability pillars has outlived its usefulness. Now it's time to move on to another common mistake organizations make, that of underestimating cardinality. By sharing common pitfalls in this series, the hope is that we can learn from them.

This article takes a look at how underestimating cardinality in our cloud native observability can very quickly ruin our day.

Cardinality the silent killer

The biggest problem facing our developers, engineers, and observability teams is that they are flooded with cloud native data from their systems. Collecting all that we need versus exactly what we use to monitor our organizations applications and infrastructure is a constant struggle. This struggle leads to ever increasing stress and frustration within teams trying to maintain some sense of order in the constant chaos.

Frustrations mounting?

The following quote is from an online forum where an SRE asked the community about dipping his feet into the tracing pond. This is one form of gathering observability information across service calls, tracing the entire process end to end. He stated that he did not yet use tracing, but wanted to ask those that do, why they did and those that don't, why not. Simple question, right?

The first answer was dripping with the frustration and sarcasm you find in any organization that is struggling with cloud native observability.

"I don't yet collect spans/traces because I can hardly get our devs to care about basic metrics, let alone traces. This is a large enterprise with approx. 1000 developers."

Next thing you know, a new deployment in our organization happens to trigger an explosion of data by adding a small metric change triggering an exponential increase in cardinality. By collecting some unique value, this change has caused a huge spike in observability data that is now causing delays in all queries and dashboards. On-call support staff are now being paged and it plays out just like the 2023 Observability Report research showed. We spend on average 10 hours per week trying to triage and understand incidents, that's a quarter of our 40 hour work week!

Cardinality spikes is another example of the cloud native complexity that is leading our developers and engineers to feel like they are drowning. From the same observability report, 33% surveyed said those above mentioned issues spill over and disrupt their personal lives. 39% felt frequently stressed out.

The only way to get a handle on this is to have more insights and control over how we are going to deal with high volumes of cloud native observability data before it has the above described negative impact. We need a way to analyze, refine, and operate on our observability data live during ingestion and be able to take immediate action when problems arise, like cardinality issues.

When we have access to something like the above shown control plate, we can stop incoming cardinality issues before they are overloading our systems, preventing that data from flooding our backend storage until we can sort out what is going on. This ease of control provides on-call engineers with powerful capabilities for taking immediate and decisive action resulting in better outcomes.

Ignoring existing landscape?

The road to cloud native success has many pitfalls and understanding how to avoid the pillars, focusing instead on solutions for the phases of observability will save much wasted time and energy.

Coming up next

Another pitfall organizations struggle with in cloud native observability is ignoring their existing landscape. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud native observability efforts.

Below are the links to the other articles in this series:

No comments:

Free Trend Report (download)

For DZone's 2024 Cloud Native Trend Report, we further explored these pillars, focusing our research on learning how nuanced technology and methodologies are driving the vision for what cloud native means and entails today. The articles, contributed by experts in the DZone Community, bring the pillars into conversation via topics such as automating the cloud through orchestration and AI, using shift left to improve delivery and strengthen security, surviving observability challenges, and strategizing cost optimizations... (contributing author Eric D. Schabell)