Eric D. Schabell: O11y Guide - Who are the Cloud Native Observability Players?

Wednesday, September 14, 2022

O11y Guide - Who are the Cloud Native Observability Players?

This is a continuation of the series taking you on my journey into the world of cloud native observability. It's a world that is altering the way developers work in their daily jobs, it's creating new teams, and there are roles appearing to attempt to keep control of the cloud native complexity that these large scale architectures deliver.

The first article in this series covered how developers have to deal with more than just code in a cloud native world. It shared a look at cloud native observability (o11y) and touched on what the three pillars are versus the three phases of observability.

This second article takes you out onto the playing field where you need to understand who the players are and what teams they form. It's no longer a world full of developers and operations teams as the cloud native environments have pushed right on through those traditional walls.

Let's dive right in, shall we?

The basic introduction started from the point that developers are in a world without clouds and then made the transition to a cloud native development world. What's this mean for them and what are some of the challenges they are having to embrace?

The playing field

Over time the traditional developer and operations teams saw a transition to different ways of working in the cloud native world. The developers transitioned into DevOps teams where the operations activities merge and attempts are made with process agility. Organizations have tried DevOps, to platform engineering, and then move to a more mature structure called CloudOps with a clear focus on cloud infrastructure. Beyond this, we're seeing today a role emerge known as Site Reliability Engineer (SRE), who's part of a team that is focused on a broader spectrum of modern resource reliability and not just for the organization's cloud infrastructure. Finally, at the larger scale of cloud native operations there is a new kid on the block, known as site reliability team.

Let's look at each one, shall we?

DevOps teams

DevOps is a first step on the road to cloud native operations and bridges both development and operations teams. In this definition you see that they have a specific mandate.

"DevOps is primarily the automation and optimization of the application development lifecycle, including post-launch fixes and updates. It uses continuous development, integration, testing, and deployment of cloud, computer, and downloadable applications. It also focuses on IT operations as they relate to application performance and availability."

By bringing operations and development closer to focus on processes and automation, they are making the push for agility, reliability, and speed for business goals within their organization. It remains focused, often due to the existence of more than just the cloud native infrastructure, on application development and delivery.

Platform engineering teams

The next team to appear on the scene is one that takes the lessons learned from the DevOps experience and owns the engineering self-service experience as defined here:

“Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era.”

The idea being that if the experience is more self-service and pre-defined infrastructure for the deployment of engineering projects, then deploying code will become less time consuming for developers.

CloudOps teams

This definition put's CloudOps in the center of a business operational focus.

"...CloudOps provides organizations with proper (cloud) resource management. In an organization, CloudOps uses DevOps principles and IT operations applied to a cloud-based architecture to speed up the business processes."

This is a shift towards operations focusing on the cloud native infrastructure more specifically than the other possible infrastructures available in an organization. Once the footprint of dependency on infrastructure choices from the past has been reduced, these teams are scaled up to ensure the improvement of development architecture (infrastructure in the cloud). They focus on simplification of cloud provisioning, application deployment to the cloud, and are big users of observability platforms for both application and infrastructure in the cloud.

Site reliability teams

Oscar Wilde once said, "With age comes wisdom, but sometimes age comes alone." As organizations become more active in a cloud native world and scale up to full CloudOps teams alongside their DevOps teams, there is another role emerging to fill a gap left behind. That role is an SRE and they don't only focus on the cloud native infrastructure.

"Instead, an SRE is an all-purpose role that aims to manage reliability for any type of environment."

SRE's have to use both IT operations and development strategies to ensure that there is a focus on one thing, and one thing only, that of reliability. It's a full time job avoiding downtime and optimizing performance of all applications and supporting infrastructure regardless of it being in the cloud native world or not. Together with CloudOps teams they are a very active player in cloud native observability and the platforms used to assist them. They have a vested interest in cloud or multi-cloud security, costs, deployment automation, and all things that help observability at scale.

Central observability teams

The newest evolution was predicted by Martin Mao back in December 0f 2021:

“This team is responsible for defining observability standards and practices, delivering key data to engineering teams and managing the tooling and storage of observability data, among other things.”

This team has become more the norm than the exception over this last year as organizations investing in cloud native at scale ramp up their observability practices. Their main focus is to define standards and practices that can be used by everyone, thus centralizing observability in their organization.
The following are four functions that the central observability team should own:

Define: Define monitoring standards and practices
Deliver: Provide monitoring data to eng teams. Must be in a format they are familiar with (i.e., Prometheus).
Measure: Ensure reliability and stability of monitoring solutions.
Manage: Manage tooling and storage of metrics data. Make it simple: if it takes a ninja, people won’t use it.

This has been a quick, down and dirty look at the teams on the field, now let’s move on to the game.

The observability game

This takes us from the basic introduction, followed by a tour of the o11y playing field, and finally you've met the players on the teams involved in cloud native o11y.

Next up, I want to dive deeper into the types of observability data and why at scale you might want to start thinking about the phases of cloud native o11y instead.