Skip to content

Conversation

@pgellert
Copy link
Contributor

@pgellert pgellert commented Jan 16, 2026

When fetches retry internally due to min_bytes not being satisfied, partitions with changed metadata were incorrectly excluded from the response. The issue was that update_fetch_partition() both checked whether a partition should be included AND mutated the session cache. On retry, the cache already reflected the response, so no change was detected.

Split into partition_has_changes() for the check in set() and update_session_partition() for the mutation in send_response().

Also re-enables the source_high_watermark check in cluster linking tests.

Fixes https://redpandadata.atlassian.net/browse/CORE-14617
Fixes https://redpandadata.atlassian.net/browse/CORE-14653

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Bug Fixes

  • Fixed an issue where Kafka fetch session responses could omit partitions with changed metadata (high watermark, log start offset, last stable offset) when the fetch retried internally due to min_bytes not being satisfied.

@pgellert pgellert requested review from a team, ballard26, bharathv and joe-redpanda January 16, 2026 16:59
@pgellert pgellert self-assigned this Jan 16, 2026
@pgellert pgellert requested review from IoannisRP and Copilot and removed request for a team January 16, 2026 16:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug (CORE-14617) where Kafka fetch sessions could incorrectly exclude partitions with changed metadata (high watermark, log start offset, last stable offset) when fetches retry internally due to min_bytes not being satisfied.

Changes:

  • Separated the logic for checking if a partition has changes from the logic for updating the session cache
  • Re-enabled source_high_watermark check in cluster linking tests that was previously disabled due to this bug
  • Added a regression test to verify partitions with changed metadata are included in fetch responses even during retries

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/v/kafka/server/handlers/fetch.cc Split update_fetch_partition() into partition_has_changes() (check only) and update_session_cache() (mutation only); moved cache updates to send_response() to avoid premature updates during retries
src/v/kafka/server/tests/fetch_test.cc Added regression test verifying partitions with changed log_start_offset are included in incremental fetch responses even when min_bytes triggers internal retry
tests/rptest/tests/cluster_linking_e2e_test.py Re-enabled source_high_watermark validation check that was previously disabled due to this bug

Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - just request that you rerun that one re-enabled test a bunch

@joe-redpanda
Copy link
Contributor

Does this work if a fetch response gets lost on the wire + client retries?

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 16, 2026

CI test results

test results on build#79182
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterConfigLegacyDefaultTest test_legacy_default_explicit_before_upgrade {"wipe_cache": false} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_legacy_default_explicit_before_upgrade
ClusterConfigLegacyDefaultTest test_removal_of_legacy_default_overriden {"wipe_cache": true} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_removal_of_legacy_default_overriden
ClusterConfigLegacyDefaultTest test_removal_of_legacy_default_overriden {"wipe_cache": true} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7d3-4603-4415-9f0f-6de1dd251383 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_removal_of_legacy_default_overriden
CompactionRecoveryUpgradeTest test_index_recovery_after_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CompactionRecoveryUpgradeTest&test_method=test_index_recovery_after_upgrade
CompactionRecoveryUpgradeTest test_index_recovery_after_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7d3-4603-4415-9f0f-6de1dd251383 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CompactionRecoveryUpgradeTest&test_method=test_index_recovery_after_upgrade
CertificateRevocationTest test_rpc null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CertificateRevocationTest&test_method=test_rpc
DatalakeE2ETests test_avro_schema {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeE2ETests&test_method=test_avro_schema
DatalakeE2ETests test_field_names_compat {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark"} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeE2ETests&test_method=test_field_names_compat
DatalakeE2ETests test_iceberg_partition_key_file_location {"catalog_type": "nessie", "cloud_storage_type": 1, "custom_partition_spec": null} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeE2ETests&test_method=test_iceberg_partition_key_file_location
DatalakeTableNameTest test_topic_name_dot_replacement {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeTableNameTest&test_method=test_topic_name_dot_replacement
IcebergTogglingTest test_iceberg_mode_toggling {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=IcebergTogglingTest&test_method=test_iceberg_mode_toggling
SchemaEvolutionE2ETests test_reorder_columns {"cloud_storage_type": 1, "qe_and_cat": ["trino", "rest_hadoop"]} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaEvolutionE2ETests&test_method=test_reorder_columns
SchemaScaleTest schema_scale_test {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "spark", "use_partition_spec": false} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaScaleTest&test_method=schema_scale_test
DeleteRecordsTest test_delete_records_bounds_checking {"cloud_storage_enabled": true} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_bounds_checking
DeleteRecordsTest test_delete_records_segment_deletion {"cloud_storage_enabled": true, "truncate_point": "one_below_high_watermark"} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DeleteRecordsTest&test_method=test_delete_records_segment_deletion
UpgradeMigratingLicenseVersion test_license_upgrade {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=UpgradeMigratingLicenseVersion&test_method=test_license_upgrade
OffsetForLeaderEpochArchivalTest test_querying_archive null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=OffsetForLeaderEpochArchivalTest&test_method=test_querying_archive
PandaProxyInvalidInputsTest test_invalid_fetch_consumer_assignments null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PandaProxyInvalidInputsTest&test_method=test_invalid_fetch_consumer_assignments
NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 2} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
PartitionMoveInterruption test_cancelling_partition_move_x_core {"replication_factor": 1, "unclean_abort": false} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionMoveInterruption&test_method=test_cancelling_partition_move_x_core
RaftAvailabilityTest test_leader_restart null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7d3-4603-4415-9f0f-6de1dd251383 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RaftAvailabilityTest&test_method=test_leader_restart
RetentionPolicyTest test_changing_topic_retention {"acks": -1, "cloud_storage_type": 2, "property": "retention.bytes"} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b24-4685-b850-e0083fe297db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RetentionPolicyTest&test_method=test_changing_topic_retention
SelfTestTest test_self_test_mixed_node_controller_lower_version null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SelfTestTest&test_method=test_self_test_mixed_node_controller_lower_version
SIAdminApiTest test_topic_recovery_on_leader null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIAdminApiTest&test_method=test_topic_recovery_on_leader
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [2, "virtual_host"], "test_case": {"name": "(TS_Read == True, SpilloverManifestUploaded == True)"}} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TieredStorageTest&test_method=test_tiered_storage
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [1, "path"], "test_case": {"name": "(TS_Read == True, SpilloverManifestUploaded == True)"}} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TieredStorageTest&test_method=test_tiered_storage
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [1, "virtual_host"], "test_case": {"name": "(TS_Read == True, SpilloverManifestUploaded == True)"}} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TieredStorageTest&test_method=test_tiered_storage
RapidTopicRecreateTest test_topic_rapid_recreation null integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RapidTopicRecreateTest&test_method=test_topic_rapid_recreation
TopicRecoveryTest test_many_partitions {"check_mode": "no_check", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/79182#019bc7cf-6b1e-41fe-87af-a6bb3cc0a195 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecoveryTest&test_method=test_many_partitions
test results on build#79254
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/79254#019bd6c5-ffa7-43e0-b18e-57dae301927f FLAKY 29/30 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/79254#019bd6c5-ffaf-4f91-8c1e-dd4dee3f41d6 FLAKY 29/30 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
test results on build#79259
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MasterTestSuite test_remote_partition_read_cached_index unit https://buildkite.com/redpanda/redpanda/builds/79259#019bd6f4-cbcd-43ca-94c0-06d8a5f0c93f FAIL 0/1
test results on build#79312
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingMetricsTests test_link_metrics null integration https://buildkite.com/redpanda/redpanda/builds/79312#019bdb61-bfab-4ae0-aaca-18e08ce7d4b5 FLAKY 12/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.1308, p0=0.0025, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingMetricsTests&test_method=test_link_metrics

@pgellert
Copy link
Contributor Author

/ci-repeat 10
skip-redpanda-build
skip-units
dt-repeat=20
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_replication_basic

@pgellert
Copy link
Contributor Author

/ci-repeat 10
skip-redpanda-build
skip-units
dt-repeat=20
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingReplicationTests.test_replication_with_failures

@pgellert pgellert force-pushed the fix/fetch-session-partition-include branch from 830bda4 to e70ac4d Compare January 19, 2026 15:51
@pgellert
Copy link
Contributor Author

force-push: noop (only comment changes + variable renaming) to addresses review feedback

@pgellert
Copy link
Contributor Author

@michael-redpanda: just request that you rerun that one re-enabled test a bunch

Done, both tests that got this check reenabled are green:

@pgellert
Copy link
Contributor Author

@joe-redpanda: Does this work if a fetch response gets lost on the wire + client retries?

Good question, there are a couple of points here:

  1. TCP guarantees ordered delivery as long as the connection is established, so the response can't silently disappear. The response is either delivered successfully or the connection fails with an error.
  2. Fetch session epoch catches any desync between the lockstep-like state machine across the client & broker. If a client retries with a stale epoch, it gets an invalid_fetch_session_epoch error code back. --
    https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74687799#KIP227:IntroduceIncrementalFetchRequeststoIncreasePartitionScalability-FetchSessionEpoch

So the two expected scenarios are really:

  • Response is delivered -> client advances epoch and continues on
  • Connection dies -> client reconnects, establishes a new session with a full fetch, gets a new epoch to fetch with for later incremental fetches

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

When fetches retry internally due to min_bytes not being satisfied,
partitions with changed metadata were incorrectly excluded from the
response. The issue was that update_fetch_partition() both checked
whether a partition should be included AND mutated the session cache.
On retry, the cache already reflected the response, so no change was
detected.

Split into partition_has_changes() for the check in set() and
update_session_partition() for the mutation in send_response().

Also re-enables the source_high_watermark check in cluster linking tests.
@pgellert pgellert force-pushed the fix/fetch-session-partition-include branch from e70ac4d to b7c6970 Compare January 20, 2026 12:18
@pgellert
Copy link
Contributor Author

force-push: rebase to dev to fix the failure of the PR rendering CI check (pulls in this commit)

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#79312

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingMetricsTests.test_link_metrics

@michael-redpanda
Copy link
Contributor

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_linking_e2e_test.py::ShadowLinkingMetricsTests.test_link_metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants