Tuesday, September 20, 2022

O11y Guide - Cloud Native Observability Needs Phases

This is the third article in the series covering my journey into the world of cloud native observability. If you missed any of the previous articles, head on back to the introduction for a quick update.

After laying out the groundwork for this series in the initial article, I spent some time in the second article sharing who the observability players are. I also discussed the teams that these players are on in this world of cloud native o11y. 

In this third article it's time to dive a bit into the impression I'm getting from the message being pushed for cloud native o11y solutions. 

Being a developer from my early days in IT, it's very interesting to explore the complexities of cloud native o11y. Monitoring applications goes way beyond just writing and deploying code, especially in the cloud native world.

When exploring the world of o11y, there are two very distinct lines of discussion. One is the same as is often found in a developer world, where it's all about technology. This is a very developer centric and bottom up approach to any technical problem. The other is attempting to delivery on promises of agility, cost, customer satisfaction, and productivity. To simplify, it's either talking about technology features or it is focusing on business outcomes.

Talking technology pillars

The monitoring landscape has evolved over the years through three generations, guided by both technology and need. The first generation was based on the data center and keeping tabs on monolithic applications. Simply put, it was keeping tabs on what was running or not. 

The second generation consisted of application performance monitoring (APM) with the infrastructure using virtual machines and later cloud platforms. These second generation monitoring tools have been unable to keep up with the data volume and massive scale that cloud native architectures have become, so there arose a need for a new third generation for real cloud native observability at scale.

Any time there is a transition between generations in technology, there will be one side emphasizing to stick with that which has worked so far. This comes often in the form of throwing a lot of technological messaging at you with a very real focus on the features you need to succeed. I call this talking about bits and bites above business outcomes.

With the transition to cloud native o11y it's been no different, with many trying to focus the discussion around three pillars used to tackle these challenges; metricstracing, and logs. The discussion struggles to  address the sheer volume of data that cloud native causes when you focus on these three technology aspects. It ignores the complex integrations it needs to monitor across massively scaled infrastructure in the cloud native world. They just focus on three simple items in the technology realm and hope you buy into monitoring them with their products.

The three pillars are data types you do need to handle in your o11y platform, but they do not provide the path to delivering on your organizations cloud native promises. Your cloud native o11y needs are much better served with an approach designed for better business outcomes, fulfilling your promises to customers, and where you focus on three phases of o11y.

Phases to better outcomes

We all want to have better business outcomes for our organizations solutions, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value. 

The problem with the three pillars is that you are talking about technology aspects and not about solutions. It's like talking about the tools in a mechanics toolbox used to make your convertible run again, instead of focusing on the blue smoke coming out of the exhaust, the rising engine temperature, and using that data to quickly remediate the problem by replacing the seals to prevent oil leaking in the engine. 

The phases you go through start with knowing the problem is happening as fast as possible and might even lead to fixing it immediately. If not, then you start triaging based on specific information related to the problem which quickly leads to fixing it. Finally, you want to have a very deep understanding of the issues you just encountered to ensure it never happens again. 

None of these phases require you to focus on data types or specific technology details. They do need you to have the o11y platform in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly. 

Open o11y

Now that we are going to stop talking about pillars of o11y and start exploring how we can use the phases discussed above, let's start exploring what the open source world has to offer. It's more than just technology or projects, it's going to be important to understand the power and freedom that open standards provide.

Next up, I plan to dig into these open standards and see where open source might be able to take us in our cloud native o11y journey.