Monday, August 22, 2022

Cloud Data - Observability is the forgotten data

cloud data

The daily hype is all around you.

From private to public cloud, multi-cloud, and even hybrid cloud, you're overrun with information telling you this is the path to your digital future. To complicate matters while you are contemplating these choices, you are expected to keep up your daily tasks of enhancing customer experiences and agile delivery of those applications.

Wrapped up in all this delivery and architectural infrastructure, there's a multitude of decisions around data to be considered when engaging with any cloud experience. There are regulatory and compliance pressures that force you to evaluate how we collect, process, and store our observability data. Understanding the pitfalls around the collection, maintenance, and storage of your cloud data can mean the difference between failure and success within your cloud strategy.

This series is based on a talk given previously in Dublin, Ireland and was brainstormed with my good friend Roel Hodzelmans. The reactions from the audience inspired me to share the concepts in this series.

The first article in this series provided an introduction to cloud and data, what that means in a cloud-native architecture beyond just storage. In this second article, the forgotten data that is often overlooked when planning for cloud-native architectural solutions is discussed.

Observability is the forgotten data

When you look at observability you might be thinking about data generated from logs, traces, metrics, and even events across your landscape. What you probably do not realize is that many of your applications and platforms have standard installation settings that generate large amounts of observability data by default. If you are not accounting for all that data being generated when you are heading into the cloud, you are going to have a hard time meeting your budget constraints for deploying and running your production solutions.

Martin Mao stated earlier this year that the growth of observability data is out of control and talks about how organizations don't mind paying for that data if it led to better outcomes, such as happier customers, higher availability, faster remediation, or more revenue.

“Paying more for logging/metrics/tracing doesn’t equate to a positive user experience. Consider how much data can be generated and shipped. $$$. You still need good people to turn data into action. It’s remarkable how common this situation is, where an organization is paying more for their observability data (typically metrics, logs, traces, and sometimes events), then they do for their production infrastructure."

Let's take a look at a simple experiment presented in an article on the hidden cost of data observability, where a simple hello world application was deployed on a four node Kubernetes cluster on GKE (see the article for details of the setup). Scripts were used to simulate load on the application and 30 days of observability data was collected in the following categories:

  • Tracing - one trace per second over 30 days totalled 2.5M traces for total data size of 161GB.
  • End user metrics - each back-end call generated a user interaction, so over 30 days that's 2.5M EUM traces for a total data size of 1GB.
  • Logs - mileage may vary depending on configuration of your logging, but here it was a 30 day total data size of 3.4GB.
  • Metrics - collected using Prometheus configured for 10 second sample rate across the cluster for a 30 day total data size of 285GB.

Granted, this might not be a perfect example for your research, but it is simple and gives easy to follow results of just over 450GB of data for a single, simple application.

If you take into account that the average retention period for audits and compliance is at 13 months, you have to ask yourself how much data you are having to collect, transport, and store effectively across your cloud architecture(s). In modern cloud-native architectures you can be deploying multiple times a day, where a container is sometimes only around for a few minutes or hours. The default of storing the observability data generated there may not need to be 13 months? Maybe trying setting retention periods for each data type can help with your generated data volume.

Also consider the various environments that are setup and torn down weekly, or bi-weekly, such as test or lab environments. These certainly don't need extensive observability data retention, if any at all.

As Martin noted, paying for more data is one thing, but people are the core of any successful use case.

“Paying more for logging/metrics/tracing doesn’t equate to a positive user experience. Consider how much data can be generated and shipped. $$$. You still need good people to turn data into action.”

Who owns these decisions?

While realizing that there is a lot of unexpected cloud data coming out of your architecture, there remains an issue of who owns these decisions in your organization. The observability data explosion can cause a lot of issues and costs, but the question to answer is:

Do you dare to flip the switch on a new data collection in your architecture?

The following article in this series will take a look at what the industry is going to be doing in the near future to ensure there is a financial owner for their organization.