-
Notifications
You must be signed in to change notification settings - Fork 45
DOC-933 Document new consumer group lag metrics and configs #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
6e33f3d
0e8daf2
2dd3fb3
4179b60
b4a7a58
53e4592
0f4c90a
76b9fb9
9828fce
e3d69e7
74c441b
24e87e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -209,13 +209,132 @@ Leaderless partitions can be caused by unresponsive brokers. When an alert on `r | |
|
||
Redpanda's Raft implementation exchanges periodic status RPCs between a broker and its peers. The xref:reference:public-metrics-reference.adoc#redpanda_node_status_rpcs_timed_out[`redpanda_node_status_rpcs_timed_out`] gauge increases when a status RPC times out for a peer, which indicates that a peer may be unresponsive and may lead to problems with partition replication that Raft manages. Monitor for non-zero values of this gauge, and correlate it with any logged errors or changes in partition replication. | ||
|
||
=== Consumers | ||
[[consumers]] | ||
=== Consumer group lag | ||
|
||
==== Consumer group lag | ||
Consumer group lag is an important performance indicator that measures the difference between the broker's latest (max) offset and the consumer group's last committed offset. The lag indicates how current the consumed data is relative to real-time production. A high or increasing lag means that consumers are processing messages slower than producers are generating them. A decreasing or stable lag implies that consumers are keeping pace with producers, ensuring real-time or near-real-time data consumption. | ||
|
||
When working with Kafka consumer groups, the consumer group lag—the difference between the broker's latest (max) offset and the group's last committed offset—is a performance indicator of how fresh the data being consumed is. While higher lag for archival consumers is expected, high lag for real-time consumers could indicate that the consumers are overloaded and thus may need their topics to be partitioned more, or to spread the load to more consumers. | ||
By monitoring consumer lag, you can identify performance bottlenecks and make informed decisions about scaling consumers, tuning configurations, and improving processing efficiency. | ||
|
||
To monitor consumer group lag, create a query with the xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`] gauges: | ||
A high maximum lag may indicate that a consumer is experiencing connectivity problems or cannot keep up with the incoming workload. | ||
|
||
A high or increasing total lag (lag sum) suggests that the consumer group lacks sufficient resources to process messages at the rate they are produced. In such cases, scaling the number of consumers within the group can help, but only up to the number of partitions available in the topic. If lag persists despite increasing consumers, repartitioning the topic may be necessary to distribute the workload more effectively and improve processing efficiency. | ||
|
||
Redpanda provides the following methods for monitoring consumer group lag: | ||
|
||
- <<dedicated-gauges, Dedicated gauges>>: Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges. These gauges simplify monitoring by providing direct lag metrics without the need for complex calculations. This method is recommended for environments where observability platforms may not support complex queries. However, enabling these gauges may add extra processing overhead to the broker. | ||
- <<offset-based-calculation, Offset-based calculation>>: You can use your observability platform to calculate consumer group lag. Use this method if your observability platform supports functions, such as `max()`, and you prefer to avoid additional processing overhead on the broker. | ||
|
||
==== Dedicated gauges | ||
|
||
Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges. | ||
|
||
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`]: | ||
A gauge that reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential bottlenecks. | ||
|
||
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`]: | ||
A gauge that aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group. | ||
|
||
To enable these dedicated gauges, you must enable consumer group metrics in your cluster properties. Add the following settings to your Redpanda configuration: | ||
|
||
- xref:reference:properties/cluster-properties.adoc#enable_consumer_group_metrics[`enable_consumer_group_metrics`]: A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics. | ||
- xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds. | ||
JakeSCahill marked this conversation as resolved.
Show resolved
Hide resolved
JakeSCahill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
+ | ||
Set this value equal to the scrape interval of your metrics collection system. Aligning these intervals ensures synchronized data collection, reducing the likelihood of missing or misaligned lag measurements. | ||
|
||
For example: | ||
|
||
ifndef::env-kubernetes[] | ||
[,bash] | ||
---- | ||
rpk cluster config set enable_consumer_group_metrics '["group", "partition", "consumer_lag"]' | ||
---- | ||
endif::[] | ||
|
||
ifdef::env-kubernetes[] | ||
[tabs] | ||
====== | ||
Helm + Operator:: | ||
+ | ||
-- | ||
.`redpanda-cluster.yaml` | ||
[,yaml] | ||
---- | ||
apiVersion: cluster.redpanda.com/v1alpha2 | ||
kind: Redpanda | ||
metadata: | ||
name: redpanda | ||
spec: | ||
chartRef: {} | ||
clusterSpec: | ||
config: | ||
cluster: | ||
enable_consumer_group_metrics: | ||
- group | ||
- partition | ||
- consumer_lag | ||
---- | ||
|
||
```bash | ||
kubectl apply -f redpanda-cluster.yaml --namespace <namespace> | ||
``` | ||
|
||
-- | ||
Helm:: | ||
+ | ||
-- | ||
[tabs] | ||
==== | ||
--values:: | ||
+ | ||
.`enable-consumer-metrics.yaml` | ||
[,yaml] | ||
---- | ||
config: | ||
cluster: | ||
enable_consumer_group_metrics: | ||
- group | ||
- partition | ||
- consumer_lag | ||
---- | ||
+ | ||
```bash | ||
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ | ||
--values enable-consumer-metrics.yaml --reuse-values | ||
``` | ||
|
||
--set:: | ||
+ | ||
[,bash] | ||
---- | ||
helm upgrade --install redpanda redpanda/redpanda \ | ||
--namespace <namespace> \ | ||
--create-namespace \ | ||
--set config.cluster.enable_consumer_group_metrics[0]=group \ | ||
--set config.cluster.enable_consumer_group_metrics[1]=partition \ | ||
--set config.cluster.enable_consumer_group_metrics[2]=consumer_lag | ||
---- | ||
|
||
==== | ||
-- | ||
====== | ||
endif::[] | ||
|
||
Enabling `consumer_lag` may add extra processing overhead to the broker, especially in environments with a high number of consumer groups or partitions. | ||
The lower the value of `consumer_group_lag_collection_interval_sec`, the higher the frequency of metric collection, which could result in higher resource utilization. Monitor the broker's resource usage after enabling these properties to ensure that the broker can handle the additional load. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels a little overstated, the overhead should be pretty minimal, and not likely to be any more than using an external tool like Burrow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case, should we recommend that users enable Why would users choose not to? Is this not enabled by default for some reason? |
||
|
||
When these properties are enabled, Redpanda computes and exposes the `redpanda_kafka_consumer_group_lag_max` and `redpanda_kafka_consumer_group_lag_sum` gauges in the public metrics endpoint. | ||
|
||
==== Offset-based calculation | ||
|
||
If your environment is sensitive to the performance overhead of the <<dedicated-gauges, dedicated gauges>>, use the offset-based calculation method to calculate consumer group lag. This method requires your observability platform to support functions like `max()`. | ||
|
||
Redpanda provides two metrics to calculate consumer group lag: | ||
|
||
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`]: The broker's latest offset for a partition. | ||
JakeSCahill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`]: The last committed offset for a consumer group on that partition. | ||
JakeSCahill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
For example, here's a typical query to compute consumer lag: | ||
|
||
[,promql] | ||
---- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -414,6 +414,26 @@ This is an internal-only configuration and should be enabled only after consulti | |
|
||
--- | ||
|
||
=== consumer_group_lag_collection_interval_sec | ||
|
||
How often to run the collection loop when <<enable_consumer_group_metrics,`enable_consumer_group_metrics`>> contains `consumer_lag`. | ||
|
||
The lower this value, the higher the frequency of metric collection, which could result in higher resource utilization on the broker. Monitor the broker's resource usage after changing this value to ensure that the broker can handle the additional load. | ||
|
||
*Unit:* seconds | ||
|
||
*Requires restart:* No | ||
|
||
*Visibility:* `tunable` | ||
|
||
*Type:* integer | ||
|
||
*Accepted values:* [`-17179869184`, `17179869183`] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These bounds are wild, maybe I should make it bounded. |
||
|
||
*Default:* `60` | ||
|
||
--- | ||
|
||
=== controller_backend_housekeeping_interval_ms | ||
|
||
Interval between iterations of controller backend housekeeping loop. | ||
|
@@ -812,6 +832,54 @@ Maximum amount of time the coordinator waits to snapshot after a command appears | |
|
||
--- | ||
|
||
=== datalake_scheduler_block_size_bytes | ||
|
||
Size, in bytes, of each memory block reserved for record translation, as tracked by the datalake scheduler. | ||
|
||
*Unit:* bytes | ||
|
||
*Requires restart:* Yes | ||
|
||
*Visibility:* `tunable` | ||
|
||
*Type:* integer | ||
|
||
*Default:* `4_mib` | ||
|
||
--- | ||
|
||
=== datalake_scheduler_max_concurrent_translations | ||
|
||
The maximum number of translations that the datalake scheduler will allow to run at a given time. If a translation is requested, but the number of running translations exceeds this value, the request will be put to sleep temporarily, polling until capacity becomes available. | ||
|
||
*Requires restart:* Yes | ||
|
||
*Visibility:* `tunable` | ||
|
||
*Type:* integer | ||
|
||
*Default:* `4` | ||
|
||
--- | ||
|
||
=== datalake_scheduler_time_slice_ms | ||
|
||
Time, in milliseconds, for a datalake translation as scheduled by the datalake scheduler. After a translation is scheduled, it will run until either the time specified has elapsed or all pending records on its source partition have been translated. | ||
|
||
*Unit:* milliseconds | ||
|
||
*Requires restart:* Yes | ||
|
||
*Visibility:* `tunable` | ||
|
||
*Type:* integer | ||
|
||
*Accepted values:* [`-17592186044416`, `17592186044415`] | ||
|
||
*Default:* `30000` | ||
|
||
--- | ||
|
||
=== debug_bundle_auto_removal_seconds | ||
|
||
If set, how long debug bundles are kept in the debug bundle storage directory after they are created. If not set, debug bundles are kept indefinitely. | ||
|
@@ -1063,6 +1131,8 @@ Enables cluster metadata uploads. Required for xref:manage:whole-cluster-restore | |
|
||
List of enabled consumer group metrics. Accepted Values: `group`, `partition`, `consumer_lag`. | ||
|
||
Enabling `consumer_lag` may add extra processing overhead to the broker, especially in environments with a high number of consumer groups or partitions. See xref:manage:monitoring.adoc#consumers[Consumer group lag] for more information. | ||
JakeSCahill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
*Requires restart:* No | ||
|
||
*Visibility:* `user` | ||
|
@@ -1712,6 +1782,20 @@ Default value for the `redpanda.iceberg.delete` topic property that determines i | |
|
||
--- | ||
|
||
=== iceberg_disable_automatic_snapshot_expiry | ||
|
||
Whether to disable automatic Iceberg snapshot expiry. This property may be useful if the Iceberg catalog expects to perform snapshot expiry on its own. | ||
|
||
*Requires restart:* No | ||
|
||
*Visibility:* `user` | ||
|
||
*Type:* boolean | ||
|
||
*Default:* `false` | ||
|
||
--- | ||
|
||
=== iceberg_disable_snapshot_tagging | ||
|
||
Whether to disable tagging of Iceberg snapshots. These tags are used to ensure that the snapshots that Redpanda writes are retained during snapshot removal, which in turn, helps Redpanda ensure exactly-once delivery of records. Disabling tags is therefore not recommended, but may be useful if the Iceberg catalog does not support tags. | ||
|
Uh oh!
There was an error while loading. Please reload this page.