Eric D. Schabell: Telemetry Pipelines Workshop - Understanding Backpressure with Fluent Bit

Monday, May 6, 2024

Telemetry Pipelines Workshop - Understanding Backpressure with Fluent Bit

 Are you ready for getting started with cloud native observability with telemetry pipelines? 

This article is part of a series exploring a workshop guiding you through the open source project Fluent Bit, what it is, a basic installation, and setting up a first telemetry pipeline project. Learn how to manage your cloud native data from source to destination using the telemetry pipeline phases covering collection, aggregation, transformation, and forwarding from any source to any destination. 

The previous article in this series we explored using the filtering phase to modify events even based on conditions in those events. In this article we dig into what backpressure is, how it manifests in our telemetry pipelines, and take first steps to mitigate this with Fluent Bit.

You can find more details in the accompanying workshop lab.

Let's get started with this use case.

Before we get started it's important to review the phases of a telemetry pipeline. In the diagram below we see them laid out again. Each incoming event goes from input to parser to filter to buffer to routing before they are sent to  their final output destination(s).

For clarity in this article, we'll split up the configuration into files that are imported into a main fluent bit configuration file we'll name workshop-fb.conf

The backpressure problem

The purpose of our telemetry pipelines is to collect events, parse, optionally filter, optionally buffer, route, and deliver them to predefined destinations. Fluent Bit is set up by default to put events into memory, but what happens if that memory is not able to hold the flow of events coming into the pipeline?

This problem is known as backpressure and leads to high memory consumption in the Fluent Bit service. Other causes can be network failures, latency, or unresponsive third-party services, resulting in delays or failure to process data fast enough while we continue to receive new incoming data to process. In high-load environments with backpressure, there's a risk of increased memory usage, which leads to the termination of the Fluent Bit process by the hosting operating system. This is known as an Out of Memory (OOM) error.

Let’s configure an example pipeline and make it run in a constrained environment, causing backpressure and ending with the container failing with an OOM error.

In this example, we are going to cause catastrophic failure to our Fluent Bit pipelines in this lab, all examples are going to be shown using containers (Podman). It is assumed you are familiar with container tooling such as Podman or Docker.

We begin configuration of our telemetry pipeline in the INPUT phase with a simple dummy plugin generating a large amount of entries to flood our pipeline with as follows in our configuration file inputs.conf:

# This entry generates a large amount of success messages for the workshop.
[INPUT]
  Name   dummy
  Tag    big.data
  Copies 15000
  Dummy  {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah blah blah blah"}

Be sure to scroll to the right in the above window to see the full console output.

Now ensure the output configuration file outputs.conf has the following configuration:

# This entry directs all tags (it matches any we encounter)
# to print to standard output, which is our console.
[OUTPUT]
  Name  stdout
  Match *

With our inputs and outputs configured, we can now bring them together in a single main configuration file. Using a file called workshop-fb.conf in our favorite editor, ensure the following configuration is created, for now just importing two files:

# Fluent Bit main configuration file.
#
# Imports section.
@INCLUDE inputs.conf
@INCLUDE outputs.conf

Let's now try testing our configuration by running it using a container image. First thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration files. Note this file needs to be in the same directory as our configuration files, otherwise adjust the file path names:

FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4

COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf
COPY ./inputs.conf /fluent-bit/etc/inputs.conf
COPY ./outputs.conf /fluent-bit/etc/outputs.conf

Now we'll build a new container image, naming it with a version tag, as follows using the Buildfile and assuming you are in the same directory:

$ podman build -t workshop-fb:v6 -f Buildfile

STEP 1/4: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4
STEP 2/4: COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf
--> a379e7611210
STEP 3/4: COPY ./inputs.conf /fluent-bit/etc/inputs.conf
--> f39b10d3d6d0
STEP 4/4: COPY ./outputs.conf /fluent-bit/etc/outputs.conf
COMMIT workshop-fb:v6
--> e74b2f228729
Successfully tagged localhost/workshop-fb:v6
e74b2f22872958a79c0e056efce66a811c93f43da641a2efaa30cacceb94a195

Now we'll run our new container image as follows:

$ podman run workshop-fb:v6

The console output should look something like this, noting that we've cut out the ascii logo at start up. This runs until exiting with CTRL_C, but before we do that we need to get some information about the memory settings so we can create an OOM experience. 

...
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] initializing
[2024/04/16 10:14:32] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/04/16 10:14:32] [ info] [sp] stream processor started
[2024/04/16 10:14:32] [ info] [output:stdout:stdout.0] worker #0 started
[0] big.data: [[1713262473.231406588, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[1] big.data: [[1713262473.232578175, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[2] big.data: [[1713262473.232581509, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[3] big.data: [[1713262473.232583009, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[4] big.data: [[1713262473.232584217, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[5] big.data: [[1713262473.232585425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[6] big.data: [[1713262473.232586550, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[7] big.data: [[1713262473.232587967, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[8] big.data: [[1713262473.232589134, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[9] big.data: [[1713262473.232590425, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
...

Be sure to scroll to the right in the above window to see the full console output.

If we leave this pipeline running, we can explore the container stats using the following commands:

# Determine the running container id.
$ podman container list

# Use the container id you found above.
$ podman stats CONTAINER_ID

ID              MEM USAGE /  LIMIT
a9a25abc042a    8.851MB   /  3.798GB

Now we have the information to run a backpressure simulation by running our pipeline in a container configured with constricted memory. In this case we need to give it around 8.5MB limit, then we'll see the pipeline run for a bit and then fail due to overloading (OOM) using the following command:

$ podman run --memory 8.5MB workshop-fb:v6

The console output from our running container shows that the pipeline ran for a bit, in our case below to event number 1124 before it hit the OOM limits of our container environment (8.5MB). 

...
[1120] big.data: [[1713263931.234450786, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[1121] big.data: [[1713263931.234453828, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[1122] big.data: [[1713263931.234454411, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[1123] big.data: [[1713263931.234454953, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[1124] big.data: [[1713263931.234455536, {}],
...

Be sure to scroll to the right in the above window to see the full console output.

We can validate the simulation worked by inspecting the container image. Below we locate the container id of our last container that failed and then inspect it for an OOM failure to validate our backpressure worked. The following commands show that our container kernel failed and killed it due to an OOM error:

# List containers, even those not running, to find container id.
$ podman container list -a

# Use the container id you found above.
$ podman inspect CONTAINER_ID | grep OOM

"OOMKilled": true,

What we've seen is that when a channel floods with too many events to process, our pipeline instance fails. From that point onwards we are now unable to collect, process, or deliver any more events.

First try fixing backpressure

Our first try at fixing this problem is to ensure that our input plugin is not flooded with more events than it can handle. We can prevent this backpressure scenario from happening by setting memory limits on the input plugin. This means setting a configuration property mem_buf_limit that will limit the events allowed. Let's give it a try.

The configuration of our telemetry pipeline in the INPUT phase needs a slight adjustment by adding mem_buf_limit as shown, set to 2MB to ensure we hit that limit on ingesting events:

# This entry generates a large amount of success messages for the workshop.
[INPUT]
  Name   dummy
  Tag    big.data
  Copies 15000
  Dummy  {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah blah blah blah"}
  Mem_Buf_Limit 2MB

Be sure to scroll to the right in the above window to see the full console output.

Now we'll build a new container image, naming it with a new version tag of v7 using the same Buildfile and assuming you are in the same directory. When we run the new container image:

$ podman run workshop-fb:v7

The console output should look something like this, noting that we've cut out the ascii logo at start up. This runs until exiting with CTRL_C, but before we do that we see that after a certain amount of time the output shows the input plugin pauses, and then resumes. This is highlighted below: 

...
[747] big.data: [[1713273456.230644421, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[748] big.data: [[1713273456.230645505, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[749] big.data: [[1713273456.230646588, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[2024/04/16 13:17:37] [ warn] [input] dummy.0 paused (mem buf over limit)
[2024/04/16 13:17:37] [ info] [input] pausing dummy.0
[750] big.data: [[1713273456.230647630, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[751] big.data: [[1713273456.230648671, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
[752] big.data: [[1713273456.230649713, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah blah"}]
...

The results of this memory buffer limitation as you can imagine is not quite the solution to solve the backpressure issues we are dealing with. While it does prevent the pipeline container from failing completely due to high memory usage as it pauses ingesting new records, it is also potentially losing data during those pauses as the input plugin clears its buffers. Once the buffers are cleared, the ingestion of new records resumes. In the next article, we'll see how to achieve both data safety and memory safety by configuring a better buffering solution with Fluent Bit

This completes our use cases for this article, be sure to explore this hands-on experience with the accompanying workshop lab.

What's next?

This article walked us through how backpressure affects our telemetry pipelines and how a possible first solution is not quite the answer we needed. In the next article, we'll explore a data and memory safe solution to the problem provided by Fluent Bit.

Stay tuned for more hands on material to help you with your cloud native observability journey.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.