Eric D. Schabell: O11y Guide: Cloud Native Observability Pitfalls - Controlling Costs

Monday, January 22, 2024

O11y Guide: Cloud Native Observability Pitfalls - Controlling Costs

Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? 

When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing. 

Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives that hasty decisions up front lead to big headaches very quickly down the road.

In the previous introduction, we looked at the problem facing everyone with cloud native observability. It was the first article in this series. In this article you'll find the first pitfall discussion that's another common mistake organizations make. By sharing common pitfalls in this series, the hope is that we can learn from them.

After laying the groundwork in the previous article, it's time to tackle the first pitfall where we need to look at how to control the costs and the broken cost models we encounter with cloud native observability.

O11y costs broken

One of the biggest topics of the last year has been how broken the cost models are for cloud native observability. I previously wrote about why Cloud Native Observability Needs Phases, detailing how the second generation of observability tooling suffers from this broken model.

Are you able to understand o11y costs?
"The second generation consisted of application performance monitoring (APM) with the infrastructure using virtual machines and later cloud platforms. These second generation monitoring tools have been unable to keep up with the data volume and massive scale that cloud native architectures..."

They store all of our cloud native observability data and charge for this, and as our business finds success, scaling data volumes means expensive observability tooling, degraded visualization performance, and slow data queries (rules, alerts, dashboards, etc.).

Organizations would not care how much data is being stored or what it costs if they had better outcomes, happier customers, higher levels of availability, faster remediation of issues, and above all, more revenue. Unfortunately, as pointed out on TheNewStack, “It’s remarkable how common this situation is, where an organization is paying more for their observability data, than they do for their production infrastructure.”

The issue quickly resolves itself around the answer to the question, "Do we need to store all our observability data?" The quick and dirty answer is of course not! There has been almost no incentive for any tooling vendors to provide insights into the data we are ingesting for what is actually being used and what's not. 

It turns out that when you do take a good look at the data coming in and are able to filter all your data at ingestion for what is not touched by any user, not ad-hoc queried, not part of any dashboard, not part of any rule, and not used for any alerts, it turns out to make quite a different in data costs. 

In the example above, we designed a dashboard for a service status overview, initially while ingesting over 280K data points. With the ability to inspect and clarify that a lot of these data points were not used in the organization, the same ingestion flow was reduced to just 390 single data points being stored. The cost reduction here depends on your vendor pricing, but with the effect shown here it's obviously going to be a dramatic cost control tool.

It's important to understand, we need to ingest what we can collect, but we really only want to store what we are actually going to use for queries, rules, alerts, and visualizations. Below is an architectural view of how we are assisted by having control plane functionality and tooling between our data ingestion and data storage. Any data we are not storing can later be passed through to storage should a future project require it. 

Finally, without standards and ownership of the cost controlling processes in an organization there is little hope of controlling costs. To this end the FinOps role had become critical to many organizations and the entire field has started a community in 2019 known as the FinOps Foundation. It's very important that cloud native observability vendors join these efforts moving forward and this should be a point of interest when evaluating new tooling. Today, 90% of the Fortune 50 now have FinOps teams.

Big pitfall in your path?
The road to cloud native success has many pitfalls and understanding how to avoid the pillars, focusing instead on solutions for the phases of observability will save much wasted time and energy.

Coming up next

Another pitfall is when organizations are focusing on The Pillars in their observability solutions. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud native observability efforts.

Below are the links to the other articles in this series:

  1. Cloud Native Observability Pitfalls - Introduction
  2. Cloud Native Observability Pitfalls - Controlling Costs
  3. Cloud Native Observability Pitfalls - Focusing on The Pillars
  4. Cloud Native Observability Pitfalls - Underestimating Cardinality
  5. Cloud Native Observability Pitfalls - Ignoring Existing Landscape
  6. Cloud Native Observability Pitfalls - The Protocol Jungle
  7. Cloud Native Observability Pitfalls - Sneaky Sprawling Mess