Eric D. Schabell: PromCon EU 2023 - Observability Recap in Berlin

Monday, October 2, 2023

PromCon EU 2023 - Observability Recap in Berlin

As previously mentioned, last week I was on site at the PromCon EU 2023 event for two days in Berlin, Germany. 

This is a community organized event focused on the technology and implementations around the open source Prometheus project, including for example PromQL and PromLens.

Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks it was invaluable to have the common discussions and chatting that happens between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project.

Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear here.

Let's dive right in and see what the event had to offer this year in Berlin.

This overview will be my impressions of the each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event:

Venue for PromCon EU 2023

  • OpenTelemetry interoperability in all flavors is the hot topic of the year.
  • Native Histograms were a big topic the last two years, but this year showing up as having a lot of promise here and there, but not a big topic in this year's talks.
  • Perses dashboard and visualization project presented their Alpha release as a truly open source project based on Apache 2.0 license.
  • By my count ~150 attendees and they also live streamed all talks / lightning talks which will also be made available on their youtube channel post event.

Day 1

The day started with a lovely walk through the center of Berlin and to the venue located on the Spree river. The event opened and jumped right into the following series of talks (insights provided inline):

What's New in Prometheus and Its Ecosystem

Summary - The history of Prometheus was presented (born at SoundCloud in Berlin), new team members, the basic architecture, and a final overview of new and upcoming features.

What’s new last 6 months:

Hanging with core maintainers

  • Native Histograms - efficiency and more details.
  • Documentation note on “...native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated.”
  • stringlabels - storing labels differently for significant memory reduction.
  • keep_firing_for field faded to alerting rules - how long an alert will continue firing after the condition has occurred.
  • scrape_config_files - spit prom scrape configs into multiple files, avoiding having to have big config files.
  • OTLP receiver (v2.47) - experimental support for receiving OTLP metrics.
  • SNMP Exporter (v0.24) - breaking changes - new configuration format. Splits connection settings from metrics details, simpler to change. Also added ability to query multiple modules in one scrape using just one scrape.
  • MySQLd Exporter (v0.15) - multi-targets support, use single exporter to monitor multiple MySQL-alike servers.
  • Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms.
  • Alertmanager - new receivers. MS Teams, Discord, Webex.
  • Windows Exporter - now an official exporter, was delayed due to licensing but in final stages now.
  • Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar

What’s coming:

  • New AlertManager UI
  • Metadata Improvements 
  • Examplar Improvements
  • Remote Write v2

Perses: The CNCF candidate for observability visualization

Summary - announcement of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility, purpose-built for observability data, truly open source alternative with Apache 2.0 license. 

Perses was born from the CNCF landscape missing visualization tooling projects:

  • Perses - An exploration to a standard dashboard format
  • Chronosphere, Red Hat and Amadeus are displayed founding members.
  • GitOps friendly, static validation, Kubernetes support, and you can use Perses binary in your development environment.
  • Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console. 

  • There is exploration of it's usage with Prometheus / PromLens.

  • Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry.
  • Logs are on the future wishlist.
  • Feature details presented for the development of dashboards.
  • Includes Grafana migration tooling.

I was chatting with core maintainer Augustin Husson after the talk and they are interested in submitting Perses as an applicant for the CNCF Sandbox status.

Towards making Prometheus OpenTelemetry native

Summary - OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental.

Details on the effort:

  • OTLP ingestion is there experimentally.

  • The experience with target_info ia a big pain point at the moment

  • Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip
  • New Arrow based OTLP protocol promises half the bandwidth again at half the CPU cost, may inspire Prometheus remote write 2.0
  • Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients.

Planet scale monitoring: Handling billions of active series with Prometheus and Thanos

Summary - Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus.

Details on the effort: 

  • Huge Ruby shop, latency sensitive, large scaling events around the retail cycle and flash sales
  • HPA struggles with scaling up quickly enough
  • using statsd to get around Ruby/Python/PHP specific limitations on shared counters
  • backend is Thanos based but have added a lot on top of it (custom work)
  • have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution)
  • have a router layer on top of Thanos to decouple ingestion and storage, sounds like they're evolving into a a Mimir like setup
  • split the query layer into two deployments: one for short term queries and one for longer term queries
  • team and service centric UI for alerting, integrated with SLO tracking
  • Berlin iconic skyline
    native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work, as they stated, "this changed the game for us."
  • when migrating from previous observability vendor, they decided not to convert dashboards, instead worked with developers to build new cleaner ones

  • developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory so it's not a big issue

Lightning talks

Summary - Always fun to end the day with quick series of talks that are ad-hoc collected from the attendees, below a list of ones I thought were interesting and a short summary should you want to find them in the recordings:

  • AlertManager UI

    • Alertmanager will get a new UI in React. ELM didn't get traction as a common language. Considering alternatives to Bootstrap.

  • Implementing integrals with Prometheus and Grafana

    • integrals in PromQL: inverse of rates, Pure-PromQL version of the delta counter we do. Using sum_over_time and Grafana variables to simplify getting all the right factors.

  • Metrics have a DX Problem
    • looking at how to do developer focused metrics from the IDE using autometircs-dev project on github. framework for instrumenting by function, with IDE integration to explore prod metrics. interesting idea to integrate this deeply.

Day 2

After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline):

Taming the Tsunami: low latency ingestion of push-based metrics in Prometheus

Summary - overview of the metrics story at Shopify, with over 1k teams running it.

  • originally forwarding metrics "from observability vendor agent"
  • issues because that was multiplying the cardinality across exporter instances, same with sidecar model
  • built a statsd protocol aware load balancer
  • running as a sidecar also had ownership issues, stated as, "we would be on call for every application"
  • daemonset deployment meant resource usage and hot-spotting concerns, also cardinality but at a lower level
  • didn't want per instance metrics because of cardinality and metrics are more domain level
  • roughly one exporter per 50-100 nodes
  • load balancer sanitizes label values and drops labels
  • Traditional currywurst
    do pre aggregation on short time scales to deal with "hot loop instrumentation", resulted in roughly 20x reduction in bandwidth use
  • compensating for lack of per instance metrics by looking at infrastructure metrics (KSM, cAdvisor)
  • "we have close to a thousand teams right now"

Prometheus Java Client 1.0.0

Summary - v1.0.0 released last week, this talk was overview of some of their updates featuring native histograms and OpenTelemetry support. 

  • rewrote the underlying model so breaking changes with the migration module for Prom simpleclient metrics.
  • almost as simple as importing changes in you Java app to use, going to update my workshop Java example for instrumentation to the new API.
  • exposes native + classic histograms by default, scraper's choice
  • a lot more configuration available as Java properties
  • callback metrics (this is great for writing exporters)
  • OTel push support (on a configurable interval)
  • allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format
  • integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away
  • despite supporting OTel, this is still a performance minded client library
  • all metric types support concurrent updates
  • dropped pushgateway support for now, but will port it forward
  • once JMX exporter is updated, as a side effect you can update
  • not aiming to become a full OTel library, only future proofing your instrumentation, more lightweight & perf focused

Lightning talks

Summary - Again a list of lightning talks I thought were interesting from the final day and a short summary should you want to find them in the recordings:

  • Tracking object storage costs
    • trying to measure Object Storage costs as they are the number 2 cost in their cloud bills. Built a Prometheus Price Exporter!
    • object storage cost is ~half of Grafana's cloud bill, varies by customer (can be as low as 2%)
    • trick for extending sparse metrics with zeroes: or on() vector(0)
    • they have a prices  exporter in the works, promised to open source it

  • Prom operator - what’s next?
    • tour of some more features coming in the prometheus operator; shards autoscaling, scrape classes, support kubernetes events, and prometheus-agent deployment as daemonset
  • Prometheus adoption stats
    • 868k users in 2023. (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled

Final impressions of this event left me for the second straight year with the feeling that the attendees are both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have getting started sessions and most of this assumes you are coming for in depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus. 

It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus, you will gain insights into the status of features in the monitoring world.