Eric D. Schabell: Ops Happiness - The Quest for Operations Intelligence

Monday, February 6, 2017

Ops Happiness - The Quest for Operations Intelligence

In the pursuit of Ops Happiness...

(Written with guest author: Miguel Perez Colino, Senior Product Manger, Integrated Solutions Business Unit, Red Hat)

We have very high expectations from any Cloud Native or mode 2 applications deployed on Red Hat hybrid cloud solutions.

When running Red Hat technologies in production, we want our new workloads to be running on top of certified products. They should be architected and deployed with help from certified professionals, proactively maintained with the help of world class support services and have the option to enable organizational resources with training and certifications.

No matter how much support is put into place, the customer needs to be able to operate their hybrid clouds.

From log aggregation to IT intelligence

Let’s imagine we are delivering a solution for a customer that is building their Digital Foundations to modernize their application development. It’s based on a private Infrastructure as a Service and leverages a container application platform. We can use the Reference Architecture to help deploy our Red Hat Openshift Container Platform on Red Hat OpenStack Platform, and review compatibility beforehand using the Cloud Deployment Planner. Now, as Alessandro Perilli explains in How to manage the cloud journey? complexity can grow with scale, so it's better to tackle it from the very beginning.

Log aggregation?

Now we add management capabilities by starting with the Cloud Management Platform, with Red Hat CloudForms, that can handle the IaaS and container application platform. It manages all the components deployed and provides insights into the microservice applications built with containers, orchestrated by kubernetes, running on instances within a tenant on the capabilities provided by the hardware platform.

With this we have covered Day 0 to be able to plan, Day 1 to be able to deploy and we are facing Day 2 in which we operate the full platform. What would our customers do on Day 2 if they faced a physical issue, such as a network cable or network card failing?

The first step would be to investigate why a particular application is behaving incorrectly. We would review the metrics, the logs and the changes to the application in question and the configuration of the application server only to realize that the root of the issue is somewhere else. Then we would go to the container application platform to get the logs and config changes for it, plus the logs and metrics of the operating system underneath ... all of them, from all the Virtual Machines (VMs) to find out that the root of the issue is somewhere else.

Finally, we would go to the IaaS deployment to get all the logs, metrics and configuration changes performed as well as the ones from the operating system ... all of them, from all the physical machines to realize that the root of the issue is yet again, somewhere else ... in an improperly patched cable or failed piece of physical hardware. Even with the great tools that we have in our management portfolio, finding a root cause for an issue like this requires increasing the situational awareness of our IT. With the number of layers and pieces required for mode 2 deployments and applications, this goes from a "nice to have" to an "absolute need".

You may be asking yourself, how can this be made simpler and easier to trace?

The obvious answer is to start performing log aggregation. We are working in different fronts already. Tushar Katarki presented his thoughts at RHTE APAC 2016 and this work is in progress focusing on log aggregation. As Tushar puts it, it’s already improving the lives of users by reducing the operational burden and improving the efficiency to keep the platform running.

To see how we may improve situational awareness, based on our current log aggregation process, we need to fully understand it. To do so we can focus on the data and its journey from the moment it is generated to the moment it is consumed. To do so I had a talk with Javier Roman Espinar, an Architect with a background in High Performance computing and Big Data deployments, where he explained that this can be considered a Big Data issue and should be analyzed from that perspective.

We can use his article Big Data Enterprise Architecture: Overview as a starting point to analyze the issue and we will realize that there are several stages for logs (or other data) to make them useful. These are the stages that the data goes through:

And the mapping to our current solution with Elasticsearch + Fluent + Kibana:

... but, what other information can be processed this way that is relevant to the customer. Peter Portante, who has been working intensively on data aggregation had already replied to this question. He explains that we need Logs, but also Metrics (or Telemetry as he describes it) and Configuration to perform a full correlation of all data.

So what’s next?

Next week in part two, the series continues with a deeper look at how we are performing a fuller correlation of all the available data.