-
Notifications
You must be signed in to change notification settings - Fork 709
CORE-14617 kafka: fix fetch session retry losing partition inclusion #29304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
CORE-14617 kafka: fix fetch session retry losing partition inclusion #29304
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug (CORE-14617) where Kafka fetch sessions could incorrectly exclude partitions with changed metadata (high watermark, log start offset, last stable offset) when fetches retry internally due to min_bytes not being satisfied.
Changes:
- Separated the logic for checking if a partition has changes from the logic for updating the session cache
- Re-enabled source_high_watermark check in cluster linking tests that was previously disabled due to this bug
- Added a regression test to verify partitions with changed metadata are included in fetch responses even during retries
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/v/kafka/server/handlers/fetch.cc | Split update_fetch_partition() into partition_has_changes() (check only) and update_session_cache() (mutation only); moved cache updates to send_response() to avoid premature updates during retries |
| src/v/kafka/server/tests/fetch_test.cc | Added regression test verifying partitions with changed log_start_offset are included in incremental fetch responses even when min_bytes triggers internal retry |
| tests/rptest/tests/cluster_linking_e2e_test.py | Re-enabled source_high_watermark validation check that was previously disabled due to this bug |
michael-redpanda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - just request that you rerun that one re-enabled test a bunch
|
Does this work if a fetch response gets lost on the wire + client retries? |
|
/ci-repeat 10 |
|
/ci-repeat 10 |
830bda4 to
e70ac4d
Compare
|
force-push: noop (only comment changes + variable renaming) to addresses review feedback |
Done, both tests that got this check reenabled are green: |
Good question, there are a couple of points here:
So the two expected scenarios are really:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
When fetches retry internally due to min_bytes not being satisfied, partitions with changed metadata were incorrectly excluded from the response. The issue was that update_fetch_partition() both checked whether a partition should be included AND mutated the session cache. On retry, the cache already reflected the response, so no change was detected. Split into partition_has_changes() for the check in set() and update_session_partition() for the mutation in send_response(). Also re-enables the source_high_watermark check in cluster linking tests.
e70ac4d to
b7c6970
Compare
|
force-push: rebase to dev to fix the failure of the PR rendering CI check (pulls in this commit) |
Retry command for Build#79312please wait until all jobs are finished before running the slash command |
|
/ci-repeat 1 |
When fetches retry internally due to min_bytes not being satisfied, partitions with changed metadata were incorrectly excluded from the response. The issue was that update_fetch_partition() both checked whether a partition should be included AND mutated the session cache. On retry, the cache already reflected the response, so no change was detected.
Split into partition_has_changes() for the check in set() and update_session_partition() for the mutation in send_response().
Also re-enables the source_high_watermark check in cluster linking tests.
Fixes https://redpandadata.atlassian.net/browse/CORE-14617
Fixes https://redpandadata.atlassian.net/browse/CORE-14653
Backports Required
Release Notes
Bug Fixes