Skip to content

DOC-933 Document new consumer group lag metrics and configs #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 19, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions modules/get-started/pages/whats-new.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,22 @@ Redpanda now supports normalization of Protobuf schemas in the Schema Registry.

You now can configure Kafka clients to authenticate using xref:manage:security/authentication#enable-sasl.adoc[SASL/PLAIN] with a single account using the same username and password. Unlike SASL/SCRAM, which uses a challenge response with hashed credentials, SASL/PLAIN transmits plaintext passwords. You enable SASL/PLAIN by appending `PLAIN` to the list of SASL mechanisms.

== New metrics

The following metrics are new in this version:

=== Consumer lag gauges

Redpanda now exposes dedicated consumer lag gauges that are computed internally by the broker. These metrics eliminate the need for complex calculations based on offset metrics and work with observability platforms that lack advanced query support.

- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`]:
A gauge that reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential bottlenecks.

- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`]:
A gauge that aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group.

See xref:manage:monitoring.adoc#consumers[Monitor consumer group lag] for more information.

== New cluster properties

The following cluster properties are new in this version:
Expand Down
121 changes: 117 additions & 4 deletions modules/manage/partials/monitor-health.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -209,13 +209,126 @@ Leaderless partitions can be caused by unresponsive brokers. When an alert on `r

Redpanda's Raft implementation exchanges periodic status RPCs between a broker and its peers. The xref:reference:public-metrics-reference.adoc#redpanda_node_status_rpcs_timed_out[`redpanda_node_status_rpcs_timed_out`] gauge increases when a status RPC times out for a peer, which indicates that a peer may be unresponsive and may lead to problems with partition replication that Raft manages. Monitor for non-zero values of this gauge, and correlate it with any logged errors or changes in partition replication.

=== Consumers
[[consumers]]
=== Consumer group lag

==== Consumer group lag
Consumer group lag is an important performance indicator that measures the difference between the broker's latest (max) offset and the consumer group's last committed offset. The lag is a performance indicator of how fresh the data being consumed is. A higher lag means consumers are falling behind producers.

When working with Kafka consumer groups, the consumer group lag—the difference between the broker's latest (max) offset and the group's last committed offset—is a performance indicator of how fresh the data being consumed is. While higher lag for archival consumers is expected, high lag for real-time consumers could indicate that the consumers are overloaded and thus may need their topics to be partitioned more, or to spread the load to more consumers.
While a higher lag may be acceptable for archival consumers, a high lag for real-time consumers suggests that they might be overloaded. In such cases, you may need to repartition topics or increase the number of consumers to distribute the load more evenly.

To monitor consumer group lag, create a query with the xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`] and xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`] gauges:
Redpanda provides the following methods for monitoring consumer group lag:

- <<dedicated-gauges, Dedicated gauges>>: Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges. These gauges simplify monitoring by providing direct lag metrics without the need for complex calculations. This method is recommended for environments where observability platforms may not support complex queries. However, enabling these gauges may add extra processing overhead to the broker.
- <<offset-based-calculation, Offset-based calculation>>: You can use your observability platform to calculate consumer group lag. Use this method if your observability platform supports functions, such as `max()`, and you prefer to avoid additional processing overhead on the broker.

==== Dedicated gauges

Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges.

- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_max[`redpanda_kafka_consumer_group_lag_max`]:
A gauge that reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential bottlenecks.

- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_lag_sum[`redpanda_kafka_consumer_group_lag_sum`]:
A gauge that aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group.

To enable these dedicated gauges, you must enable consumer group metrics in your cluster properties. Add the following settings to your Redpanda configuration:

- xref:reference:properties/cluster-properties.adoc#enable_consumer_group_metrics[`enable_consumer_group_metrics`]: A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics.
- xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds.

For example:

ifndef::env-kubernetes[]
[,bash]
----
rpk cluster config set enable_consumer_group_metrics '["group", "partition", "consumer_lag"]'
----
endif::[]

ifdef::env-kubernetes[]
[tabs]
======
Helm + Operator::
+
--
.`redpanda-cluster.yaml`
[,yaml]
----
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
name: redpanda
spec:
chartRef: {}
clusterSpec:
config:
cluster:
enable_consumer_group_metrics:
- group
- partition
- consumer_lag
----

```bash
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
```

--
Helm::
+
--
[tabs]
====
--values::
+
.`enable-consumer-metrics.yaml`
[,yaml]
----
config:
cluster:
enable_consumer_group_metrics:
- group
- partition
- consumer_lag
----
+
```bash
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
--values enable-consumer-metrics.yaml --reuse-values
```

--set::
+
[,bash]
----
helm upgrade --install redpanda redpanda/redpanda \
--namespace <namespace> \
--create-namespace \
--set config.cluster.enable_consumer_group_metrics[0]=group \
--set config.cluster.enable_consumer_group_metrics[1]=partition \
--set config.cluster.enable_consumer_group_metrics[2]=consumer_lag
----

====
--
======
endif::[]

Enabling `consumer_lag` may add extra processing overhead to the broker, especially in environments with a high number of consumer groups or partitions.
The increased frequency of metric collection could result in higher resource utilization. Monitor the broker's resource usage after enabling these properties to ensure that the broker can handle the additional load.

When these properties are enabled, Redpanda computes and exposes the `redpanda_kafka_consumer_group_lag_max` and `redpanda_kafka_consumer_group_lag_sum` gauges in the public metrics endpoint.

==== Offset-based calculation

If your environment is sensitive to the performance overhead of the <<dedicated-gauges, dedicated gauges>>, use the offset-based calculation method to calculate consumer group lag. This method requires your observability platform to support functions like `max()`.

Redpanda provides two metrics to calculate consumer group lag:

- xref:reference:public-metrics-reference.adoc#redpanda_kafka_max_offset[`redpanda_kafka_max_offset`]: The broker's latest offset for a partition.
- xref:reference:public-metrics-reference.adoc#redpanda_kafka_consumer_group_committed_offset[`redpanda_kafka_consumer_group_committed_offset`]: The last committed offset for a consumer group on that partition.

For example, here's a typical query to compute consumer lag:

[,promql]
----
Expand Down
83 changes: 83 additions & 0 deletions modules/reference/pages/properties/cluster-properties.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,24 @@ This is an internal-only configuration and should be enabled only after consulti

---

=== consumer_group_lag_collection_interval_sec

How often to run the collection loop when <<enable_consumer_group_metrics,`enable_consumer_group_metrics`>> contains `consumer_lag`.

*Unit:* seconds

*Requires restart:* No

*Visibility:* `tunable`

*Type:* integer

*Accepted values:* [`-17179869184`, `17179869183`]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bounds are wild, maybe I should make it bounded.


*Default:* `60`

---

=== controller_backend_housekeeping_interval_ms

Interval between iterations of controller backend housekeeping loop.
Expand Down Expand Up @@ -812,6 +830,54 @@ Maximum amount of time the coordinator waits to snapshot after a command appears

---

=== datalake_scheduler_block_size_bytes

Size, in bytes, of each memory block reserved for record translation, as tracked by the datalake scheduler.

*Unit:* bytes

*Requires restart:* Yes

*Visibility:* `tunable`

*Type:* integer

*Default:* `4_mib`

---

=== datalake_scheduler_max_concurrent_translations

The maximum number of translations that the datalake scheduler will allow to run at a given time. If a translation is requested, but the number of running translations exceeds this value, the request will be put to sleep temporarily, polling until capacity becomes available.

*Requires restart:* Yes

*Visibility:* `tunable`

*Type:* integer

*Default:* `4`

---

=== datalake_scheduler_time_slice_ms

Time, in milliseconds, for a datalake translation as scheduled by the datalake scheduler. After a translation is scheduled, it will run until either the time specified has elapsed or all pending records on its source partition have been translated.

*Unit:* milliseconds

*Requires restart:* Yes

*Visibility:* `tunable`

*Type:* integer

*Accepted values:* [`-17592186044416`, `17592186044415`]

*Default:* `30000`

---

=== debug_bundle_auto_removal_seconds

If set, how long debug bundles are kept in the debug bundle storage directory after they are created. If not set, debug bundles are kept indefinitely.
Expand Down Expand Up @@ -1063,6 +1129,9 @@ Enables cluster metadata uploads. Required for xref:manage:whole-cluster-restore

List of enabled consumer group metrics. Accepted Values: `group`, `partition`, `consumer_lag`.

Enabling `consumer_lag` may add extra processing overhead to the broker, especially in environments with a high number of consumer groups or partitions.
The increased frequency of metric collection could result in higher resource utilization. See xref:manage:monitoring.adoc#consumers[Consumer group lag] for more information.

*Requires restart:* No

*Visibility:* `user`
Expand Down Expand Up @@ -1712,6 +1781,20 @@ Default value for the `redpanda.iceberg.delete` topic property that determines i

---

=== iceberg_disable_automatic_snapshot_expiry

Whether to disable automatic Iceberg snapshot expiry. This property may be useful if the Iceberg catalog expects to perform snapshot expiry on its own.

*Requires restart:* No

*Visibility:* `user`

*Type:* boolean

*Default:* `false`

---

=== iceberg_disable_snapshot_tagging

Whether to disable tagging of Iceberg snapshots. These tags are used to ensure that the snapshots that Redpanda writes are retained during snapshot removal, which in turn, helps Redpanda ensure exactly-once delivery of records. Disabling tags is therefore not recommended, but may be useful if the Iceberg catalog does not support tags.
Expand Down
61 changes: 53 additions & 8 deletions modules/reference/pages/public-metrics-reference.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -867,11 +867,10 @@ Committed offset for a consumer group, segmented by topic and partition.

*Labels*:

* `group`

* `topic`

* `partition`
- `redpanda_group`
- `redpanda_partition`
- `redpanda_topic`
- `shard`

---

Expand All @@ -883,7 +882,52 @@ Number of active consumers within a consumer group.

*Labels*:

* `group`
- `redpanda_group`
- `shard`

---

=== redpanda_kafka_consumer_group_lag_max

Maximum consumer group lag across topic partitions. This metric is useful for identifying the most delayed consumer group.

To enable this metric, you must enable consumer group metrics in your cluster properties. Add the following settings to your Redpanda configuration:

- xref:reference:properties/cluster-properties.adoc#enable_consumer_group_metrics[`enable_consumer_group_metrics`]: A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics.
- xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds.

*Type*: gauge

*Labels*:

- `redpanda_group`
- `shard`

*Related topics*:

- xref:manage:monitoring.adoc#consumer-group-lag[Consumer group lag]

---

=== redpanda_kafka_consumer_group_lag_sum

Sum of consumer group lag for all topic partitions. This metric is useful for tracking the total lag across all partitions.

To enable this metric, you must enable consumer group metrics in your cluster properties. Add the following settings to your Redpanda configuration:

- xref:reference:properties/cluster-properties.adoc#enable_consumer_group_metrics[`enable_consumer_group_metrics`]: A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics.
- xref:reference:properties/cluster-properties.adoc#consumer_group_lag_collection_interval_sec[`consumer_group_lag_collection_interval_sec`] (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds.

*Type*: gauge

*Labels*:

- `redpanda_group`
- `shard`

*Related topics*:

- xref:manage:monitoring.adoc#consumer-group-lag[Consumer group lag]

---

Expand All @@ -895,7 +939,8 @@ Number of topics being consumed by a consumer group.

*Labels*:

* `group`
- `redpanda_group`
- `shard`

== REST proxy metrics

Expand Down Expand Up @@ -1004,7 +1049,7 @@ Build information for Redpanda, including the revision and version details.

---

=== redpanda_application_fips_mode
=== redpanda_application_fips_mode

Indicates whether Redpanda is running in FIPS mode. Possible values:

Expand Down
11 changes: 10 additions & 1 deletion tests/docker-compose/bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,13 @@ cloud_storage_bucket: redpanda
cloud_storage_segment_max_upload_interval_sec: 60
# Continuous Data Balancing (enterprise feature) continuously monitors your node and rack availability and disk usage. This enables self-healing clusters that dynamically balance partitions, ensuring smooth operations and optimal cluster performance.
# https://docs.redpanda.com/current/manage/cluster-maintenance/continuous-data-balancing/
partition_autobalancing_mode: continuous
partition_autobalancing_mode: continuous
# Enable Redpanda to collect consumer group metrics.
# https://docs.redpanda.com/current/reference/properties/cluster-properties/#enable_consumer_group_metrics
enable_consumer_group_metrics:
- "group"
- "partition"
- "consumer_lag"
# Lower the interval for the quickstart
# https://docs.redpanda.com/current/reference/properties/cluster-properties/#consumer_group_lag_collection_interval_sec
consumer_group_lag_collection_interval_sec: 5
19 changes: 19 additions & 0 deletions tests/docker-compose/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,25 @@ services:
redpanda-0:
condition: service_healthy
####################
# rpk container to deploy a consumer group #
# See https://docs.redpanda.com/current/reference/rpk/rpk-topic/rpk-topic-consume/
####################
consumergroup:
command:
- topic
- consume
- transactions
- --group
- transactions-consumer
- -X user=superuser
- -X pass=secretpassword
- -X brokers=redpanda-0:9092
image: docker.redpanda.com/redpandadata/${REDPANDA_DOCKER_REPO:-redpanda}:${REDPANDA_VERSION:-latest}
networks:
- redpanda_network
depends_on:
- deploytransform
####################
# rpk container to deploy the pre-built data transform #
# See https://docs.redpanda.com/current/develop/data-transforms/deploy/
####################
Expand Down