Eric D. Schabell: O11y Guide: Cloud Native Observability Pitfalls

Thursday, January 18, 2024

O11y Guide: Cloud Native Observability Pitfalls - Introduction

Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability?

When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing.

Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives that hasty decisions up front lead to big headaches very quickly down the road.

In this article, you'll find an introduction to the problem facing everyone with cloud native observability. It is the first article in a series following common pitfalls that I'm seeing organizations make. The idea is that by sharing these pitfalls, we all can learn from them.

Let's get started with a little background before sharing the first cloud native observability pitfalls.

The cloud native delusion

The promise of cloud native?

It all started when we were promised the wonderful world of cloud native. This was Kubernetes, containers, and a lot of automation to allow us to design our applications and services without worrying about the underlying infrastructure. Containers meant we could isolate our code and Kubernetes meant we only had to configure how the container and infrastructure should run and respond to load changes. We point that configuration to our Kubernetes in the wild, be that self hosted or a public provider of cloud platforms, and it would automatically spin up and adjust as our business grew.

Anyone starting in the cloud native world remembers vividly that moment with container tooling on your development machine, coding, building containers, and seamlessly running those containers to your hearts content. After some testing you'd be able to push your work into your organizations build pipelines, and off it was towards production environments.

On the other side of the coin, there was Application Performance Monitoring (APM) that was able to monitor your well defined infrastructure and applications from rack mounted machines all the way to the world of Virtual Machines (VMs). With cloud native came the new world of observability, as it had to mature and deal with a more dynamic, expanding, and vast monitoring need that is cloud native environments today. We started to collect logs, then decided we needed to label things with metrics and collect those, and then there were events like code changes or feature flags, and finally trying to capture how service calls were randomly bouncing around different paths through our cloud native architectures with traces. There were tools and vendors giving us all the ability to capture, store, and visualize these signals and we thought nothing of the data we were capturing as it grew in our cloud native environments.

That's the theory anyway.

What they didn't tell us was that along with all of this automation and cloud magic, comes an almost uncontrollable flood of data at your organization. If you look around you can find all manner of examples around the amount of cloud data that a single simple test application generates over time. In this example, we find the following experiment:

A single Hello World application was deployed to a four node Kubernetes cluster. Load was generated using the script that comes with the app.

Additional scripting created to scrape the Prometheus end points and record the size of the data payloads.

Another script accepted Jaeger tracing spans and End User Metric (EUM) beacons, recording the size of the data payloads.

Fluentd collected all the logs and concatenated them all into one flat file. Using the timestamps from the log file, one hour was extracted into a new file, which was then measured.

Shocked by data volumes?

The short explanation is that the total data volumes collected using this simple example was in excess of +450GB, that's almost half a terabyte on data.

If that's not shocking enough, then look at this article that shares how "Most companies default to 13 months... " when setting data retention rates. This means you have cloud native environments creating and destroying containers multiple times a day and saving all the data generated in those processes for 13 months. We can already hear our common sense telling us that there must be a lot of data in our cloud native environments that we are not needing, nor using.

Remembering that data through the cloud pipeline and storing that data are the biggest costs we have in our cloud native environments. Seeing all of this, do we dare to flip the switch on a new data collection? Can we afford to?

Now think about cloud native a scale, meaning we are becoming very successful and our cloud native architecture swells in volume as it's automatically designed to. While we can limit the amount of scaling, this is generally done only for non-production environments as we don't ever want our core business to be limited in scale. This leads to thousands if not millions of containers in our cloud infrastructure, data flooding into our organization and into our observability platform.

Flooded by cloud native data?

In the beginning our monitoring was well defined and capable of providing us with all the insights we needed to manage our cloud infrastructure. When we have to handle business surging and cloud native environments at scale, we initially combat the flood of data by adding more observability tooling, which quickly snowballs into a vast array of cloud infrastructure and tooling, not unlike the view across a container shipping port.

This is a common practice and leads to organizations paying more for their cloud native observability infrastructure (tooling, data, and storage) than for their actual cloud business infrastructure. Our organization is drowning at the bottom of a data deluge that is cloud native at scale. We quickly fail in our ability to create visual insights (aka dashboards) so that our on-call engineers can respond in a timely fashion to any sort of failures.

While all of this doom and gloom sounds like the end of the world is nigh, there is hope. We can create a cloud native world for our organization that is structured, scales, and provides insights to our on-call engineers that keeps our business running smoothly at scale. Data visualizations can be achieved with just enough information, the rig
ht information, and the right links to activities that remediate issues quickly and effectively. We need to have the ability to control what data we need, what data we store, and how we can visualize it all in an orderly fashion.

The hope is that we can learn from others that have gone down these roads before, had their businesses massively scale, encountered these issues, and tackled their problems. The following sections outline insights gained from organizations that have tackled cloud native observability at scale and are creating both order and sustainability in their cloud native environments.

Observability costs out of control?

The road to cloud native success has many pitfalls and the more you know about them, the easier they are to tackle or avoid altogether.

Coming up next

A big pitfall is that organizations struggle with controlling costs for cloud native observability. In the next article in this series, I'll share why this is a pitfall, what you should be thinking about instead, and how the business should be central to your cloud native observability efforts.

Below are the links to the other articles in this series:

No comments:

Free Trend Report (download)

For DZone's 2024 Cloud Native Trend Report, we further explored these pillars, focusing our research on learning how nuanced technology and methodologies are driving the vision for what cloud native means and entails today. The articles, contributed by experts in the DZone Community, bring the pillars into conversation via topics such as automating the cloud through orchestration and AI, using shift left to improve delivery and strengthen security, surviving observability challenges, and strategizing cost optimizations... (contributing author Eric D. Schabell)