Skip to content

[CORE-9523] Consumer Group Lag: Set empty shard label #25383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

BenPope
Copy link
Member

@BenPope BenPope commented Mar 14, 2025

As currently implemented, the metrics contain the shard label, which can add to cardinality over time as the group coordinator can change.

Override the shard label to prevent seastar setting it, to prevent cardinality increase if the group coordinator moves shard.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@BenPope BenPope requested a review from a team March 14, 2025 13:14
@BenPope BenPope self-assigned this Mar 14, 2025
@BenPope BenPope requested review from aanthony-rp and removed request for a team March 14, 2025 13:14
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 14, 2025

CI test results

test results on build#63138
test_id test_kind job_url test_status passed
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=True.params=.cancellation.dir.in.stage.preparing.use_alias.True ducktape https://buildkite.com/redpanda/redpanda/builds/63138#0195950d-d561-4310-848f-7ca77928078e FLAKY 1/2
rptest.tests.datalake.datalake_upgrade_test.DatalakeUpgradeTest.test_upload_through_upgrade.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK ducktape https://buildkite.com/redpanda/redpanda/builds/63138#01959511-7b71-4eb4-9b2f-433e41a15b79 FLAKY 1/2
rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.TIRED_STORAGE.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/63138#01959511-7b6f-470a-ab23-61e854afd89b FLAKY 1/2
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=False ducktape https://buildkite.com/redpanda/redpanda/builds/63138#01959511-7b6f-4aa7-8fa1-eb2df340833e FLAKY 1/2
test results on build#63218
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a3a2-4ddc-46de-a45f-5b2f6a456aac FLAKY 1/2
rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a3a2-4ddb-4433-b438-f28230b8d23d FLAKY 1/2
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=False.params=.cancellation.dir.in.stage.preparing.use_alias.True ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a39e-7b35-4524-b67c-f8f073860e00 FLAKY 1/2
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=True.params=.cancellation.dir.in.stage.preparing.use_alias.True ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a39e-7b36-496c-b759-5dc5dd1cb728 FLAKY 1/3
rptest.tests.datalake.datalake_upgrade_test.DatalakeUpgradeTest.test_upload_through_upgrade.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a3a2-4ddc-41c1-b4e8-2b357d899fe0 FLAKY 1/3
rptest.tests.datalake.transactions_test.DatalakeTransactionTests.test_with_transactions.cloud_storage_type=CloudStorageType.S3.compaction=False ducktape https://buildkite.com/redpanda/redpanda/builds/63218#0195a3a2-4ddb-4d08-90da-4a50a04916c1 FLAKY 1/2
test results on build#63365
test_id test_kind job_url test_status passed
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_migrated_topic_data_integrity.transfer_leadership=True.params=.cancellation.dir.in.stage.preparing.use_alias.False ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afc4-f776-4204-880a-624b2bd6c3cc FLAKY 1/5
rptest.tests.datalake.custom_partitioning_test.DatalakeCustomPartitioningTest.test_spec_evolution.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afc4-f775-40d1-9b35-74e4204bda2d FLAKY 1/2
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afc4-f775-40d1-9b35-74e4204bda2d FLAKY 1/2
rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afb2-3709-428f-afdc-6117f84041a3 FLAKY 1/5
rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afc4-f777-4690-a930-0b144eb51da4 FLAKY 1/2
rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON ducktape https://buildkite.com/redpanda/redpanda/builds/63365#0195afb2-370b-4af1-ba19-35ffb25fe372 FLAKY 1/3

@BenPope BenPope force-pushed the core-9523/consumer_group_lag_aggregate_shard branch from 8b6e49e to 362e600 Compare March 17, 2025 08:40
@BenPope BenPope requested a review from IoannisRP March 17, 2025 08:40
IoannisRP
IoannisRP previously approved these changes Mar 17, 2025
@StephanDollberg
Copy link
Member

This is a bit of an anti pattern because of the issue described here: #23339

I don't think cardinality itself is an issue anymore as per previous discussions. If you really want to avoid the shard label changing I think best would be to explicitly add the shard label with a static value ("0" or "") as that will prevent seastar from adding it automatically (with the varying shard).

BenPope added 2 commits March 19, 2025 16:52
As currently implemented, the metrics contain the shard label,
which can add to cardinality over time as the group coordinator
can change shard.

Reduce cardinality by explicitly setting shard label empty, to
prevent seastar from adding it.

Signed-off-by: Ben Pope <[email protected]>
A group which has transitioned to dead does not need metrics;
ensure metrics are only emmitted when state is not dead by
checking the state in setup_metrics(), and set_state where
all transitions pass through.

This avoids making requests for partition offsets when not
necessary.

Signed-off-by: Ben Pope <[email protected]>
@BenPope BenPope force-pushed the core-9523/consumer_group_lag_aggregate_shard branch from 362e600 to d64ca1b Compare March 19, 2025 16:53
@BenPope BenPope changed the title [CORE-9523] Consumer Group Lag: Aggregate by shard [CORE-9523] Consumer Group Lag: Set empty shard label Mar 19, 2025
@BenPope BenPope requested a review from IoannisRP March 19, 2025 16:54
@BenPope
Copy link
Member Author

BenPope commented Mar 19, 2025

Changes in force-push

  • Address review comment, set shard label empty.

Copy link
Contributor

@IoannisRP IoannisRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BenPope BenPope merged commit 2ffeec6 into redpanda-data:dev Mar 21, 2025
23 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants