Telemetry rework: Moving towards a Prometheus-first model #3663

timwoj · 2024-03-27T17:43:00Z

I've been working for a few months on a big rework of the telemetry/metrics system inside of Zeek. It's been through a few iterations, and I'm at a point where I have some unanswerable questions so I'm bringing them up in a ticket. The current design looks like this:

Everything is based on the prometheus-cpp library underneath. I originally tried to use opentelemetry but ran into a bunch of shortcomings with their library and documentation and decided to move away from it.
All of the existing metrics code moves out of broker/CAF into a framework in Zeek itself. That includes all of the script-land code and options.
The built-in aggregation of metrics to the manager node no longer exists. Each node now presents an endpoint for requesting metrics data for Prometheus. There is now a service discovery endpoint on the manager node. Prometheus can be configured to use that information to automatically request data from all of the nodes, instead of having to manually configure each one. A new metrics_port field is added to the cluster node record that allows setting the port used by the node for metrics output.
A lot of the internal data we used to store about various metrics is now removed in favor of using whatever prometheus-cpp provides to us via their classes. The reasoning behind this is that we want to be able to pass the prometheus-cpp registry to external projects directly (broker/CAF being the main one) without them having to know anything about Zeek's framework itself. This lets those projects add metrics and Zeek will treat them the same as if we're add them ourselves. Our classes basically become thin wrappers around those objects, mostly providing the OpaqueVal interface to script-land plus a few other niceties.

Comments/questions above:

(2) This is a breaking change. All of the script-land options that were in the Broker module were removed, including the option change handlers, etc.
(3) This will add a management burden to users. Primarily, they need to be able to plumb whatever ports they configured for each node so that Prometheus can talk to it.
(4) This is where the majority of my questions lie:

prometheus-cpp doesn’t keep prefix and names separately. It just stores one full, concatenated name with prefix/name/unit/total all together. This means the prefix field in the RecordVal can’t be filled in, and the name field will contain more than just the name itself. Based on the documentation from Prometheus, a metric should contain a prefix but not must contain one. I can make the assumption that all metrics in Zeek will have a prefix, scrape the first word in the name from before the first underscore, and use that as a prefix value, but that could break down if some external library does not include a prefix on one of their metrics. See the telemetry log policy for why this is a problem. It does some filtering based on the prefix values returned from Zeek.
prometheus-cpp doesn’t keep unit information separate, so we can’t add that to the RecordVal. It's not possible to scrape this out of the full name because units may contain underscores, so it's impossible to know what set of words in a name are the unit. This page shows some samples of units, plus general naming rules for the other fields.
prometheus-cpp doesn’t keep is_sum information separate, but we can interpret a metric name ending with _total to mean the same thing.
prometheus-cpp only stores doubles, so I have to do a little bit of finagling with external metrics to set the right type in the RecordVal.
prometheus-cpp doesn’t store label keys in the family, so I have to store them separately because we have other code in Zeek that validates whether the labels on a instrument match the labels on a family. For external metrics, we use the label names from the first metric in a family, making an assumption that the set of labels will always be the same for each.

The text was updated successfully, but these errors were encountered:

timwoj · 2024-03-27T19:42:40Z

Tagging @Neverlord since you might have thoughts about all of this.

timwoj · 2024-04-09T17:58:51Z

This work is mostly done at this point, absent a few things:

Answers to the questions above
Reworking broker to use prometheus-cpp (the linked issue above)
Looking into whether the commented-out unit tests in telemetry/Manager.cc are still needed or relevant

timwoj · 2024-04-09T20:59:04Z

Another question: We have an option called Telemetry::metric_export_prefixes that could be used to prevent certain metrics from being exported to the aggregation to master. The prefix bit notwithstanding, does it make sense to keep this option now that we're just sending everything from all of the nodes to Prometheus?

timwoj · 2024-04-22T19:08:31Z

@ckreibich and I discussed this at length in a call. Here are the decisions made:

Using the word before the first underscore as the prefix is acceptable.
Fix the scripts and unit tests that use the Telemetry::MetricOpts$unit field to not print them. There's already a comment that it won't get filled in by Telemetry::collect_metrics or Telemetry::collect_histogram_metrics, so there's no point in printing out a blank field. We don't remove it entirely, because it's used in the construction of new metrics. The reasoning for it not being filled in is that it would break with third-party (e.g. broker/caf) metrics, and it doesn't seem useful for consumers at all. The alternative would be to make it only available for Zeek metrics and fill it with something like <unknown> otherwise, but that doesn't make it any more useful. If people want to know the units, they can look at the full metric name.
For label names, add our own validation for third-party metrics and throw an error if it fails.
Remove the metric_export_prefixes option entirely. We're moving to just pushing everything to Prometheus and letting filtering happen at that level. Without our own aggregation, there's no use for this option anymore.
Open a PR ahead of the broker/caf changes.

timwoj self-assigned this Mar 27, 2024

timwoj mentioned this issue Apr 8, 2024

Convert Broker to use prometheus-cpp registry from Zeek zeek/broker#400

Open

timwoj linked a pull request Apr 25, 2024 that will close this issue

Telemetry Rework #3701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry rework: Moving towards a Prometheus-first model #3663

Telemetry rework: Moving towards a Prometheus-first model #3663

timwoj commented Mar 27, 2024 •

edited

timwoj commented Mar 27, 2024

timwoj commented Apr 9, 2024

timwoj commented Apr 9, 2024

timwoj commented Apr 22, 2024 •

edited

Telemetry rework: Moving towards a Prometheus-first model #3663

Telemetry rework: Moving towards a Prometheus-first model #3663

Comments

timwoj commented Mar 27, 2024 • edited

timwoj commented Mar 27, 2024

timwoj commented Apr 9, 2024

timwoj commented Apr 9, 2024

timwoj commented Apr 22, 2024 • edited

timwoj commented Mar 27, 2024 •

edited

timwoj commented Apr 22, 2024 •

edited