In case you were still unaware, KubeCon + CloudNativeCon is the flagship conference of the Cloud Native Computing Foundation (CNCF), gathering adopters and technologists from leading open source and cloud native communities. The event brings together the entire cloud native ecosystem for education, collaboration, and networking opportunities.
Every year this event features multiple co-located events, and we're excited to be presenting at Cloud Native AI & Kubeflow Day, which focuses on the intersection of cloud native technologies and artificial intelligence workloads.
Below are all the details of our session at the time of this writing.
On Monday, 23 March in Amsterdam, Ryan and I will be taking the stage to demonstrate real-world troubleshooting of LLM-powered applications for the following session.
From Hallucinations to Hardware: Diagnosing LLM Failures
Generative AI apps can hallucinate—or fail—at the worst possible times. In this live demo session, we'll interact with an LLM-powered application designed to surface both entertaining hallucinations and real-world GPU performance issues. Using open source tools like Prometheus, OTel, NVIDIA DCGM, and OpenInference, we'll troubleshoot problems in real time and trace them from user experience down to infrastructure. See how observability gives engineers and SREs the visibility they need to keep AI systems reliable.
Time: 15:20-15:45
This is a hands-on, live demonstration session where we'll be working with a real LLM-powered application. We've intentionally designed it to showcase both the amusing side of AI hallucinations and the serious infrastructure challenges that can impact production systems. The session will walk through the complete troubleshooting journey, from identifying user-facing issues all the way down to GPU performance bottlenecks.
We'll be leveraging a powerful stack of open source observability tools including Prometheus for metrics collection, OpenTelemetry for distributed tracing, NVIDIA DCGM for GPU telemetry, and OpenInference for LLM-specific observability. This combination provides the comprehensive visibility needed to diagnose and resolve issues in modern AI infrastructure.
For engineers and SREs working with AI systems, this session will demonstrate practical approaches to maintaining reliability in production environments. The live demo format means you'll see real troubleshooting workflows in action rather than theoretical examples.
Be sure to check the Cloud Native AI & Kubeflow Day schedule for the exact time and location of our session. We hope to see you there in Amsterdam!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.