Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Consumers getting stuck since read more entries doesn't properly handle retries with backoff #23264

Closed
2 of 3 tasks
lhotari opened this issue Sep 6, 2024 · 2 comments
Closed
2 of 3 tasks
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented Sep 6, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

all release versions

Minimal reproduce step

No steps to reproduce yet. This is based on reading the code and is more of a concern that a proven bug.

What did you expect to see?

Consumers shouldn't get stuck in complex scenarios where a trigger to read more gets ignored due to some other rule.

What did you see instead?

In dispatchers, the way how readMoreEntries is handled is not consistent.
It is very likely that a consumer will stop reading more entries and dispatching entries to a consumer when it would be necessary to do so. The reason for this is that there isn't proper tracking for the "signals" that trigger reading more and whether the signal is being handled. Now there are multiple flags and ways to prevent multiple reads in flight at once. This needs a refactoring to ensure that the behavior is consistent and reliable.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost labels Sep 6, 2024
@lhotari
Copy link
Member Author

lhotari commented Sep 6, 2024

There's so many triggers for resuming to read more entries. One of many triggers:

@Override
public void markDeletePositionMoveForward() {
// Execute the notification in different thread to avoid a mutex chain here
// from the delete operation that was completed
topic.getBrokerService().getTopicOrderedExecutor().execute(() -> {
synchronized (PersistentStickyKeyDispatcherMultipleConsumers.this) {
if (recentlyJoinedConsumers != null && !recentlyJoinedConsumers.isEmpty()
&& removeConsumersFromRecentJoinedConsumers()) {
// After we process acks, we need to check whether the mark-delete position was advanced and we
// can finally read more messages. It's safe to call readMoreEntries() multiple times.
readMoreEntries();
}
}
});
}

@lhotari
Copy link
Member Author

lhotari commented Oct 14, 2024

Addressed in #23231 and PIP-379 changes for Key_Shared.

@lhotari lhotari closed this as completed Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost type/bug The PR fixed a bug or issue reported a bug
Development

No branches or pull requests

1 participant