-
Notifications
You must be signed in to change notification settings - Fork 1.4k
replica: swallow barrier exceptions in table_sync_and_check #18613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
repair/table_check.cc
Outdated
// Trigger read barrier to synchronize schema. | ||
co_await mm.get_group0_barrier().trigger(as); | ||
} | ||
} catch (...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to swallow any exception here, just some specific exceptions, like the barrier timing out.
For this reason, we also shouldn't enclose all this code in try..catch
, just the barrier itself.
table_sync_and_check triggers read barrier to synchronize schema and checks whether a given table exists. If the barrier times out, service::raft_operation_timeout_error is propagated. Swallow exception and fall back on local schema check in case of barrier timeout. Fixes: scylladb#18490.
|
The barrier is there for correctness, we want repair to see all previous group0 changes, so we cannot ignore its failure. If we don't see previous changes, repair may think table is dropped while it's actually not yet created. There are other reasons mentioned here: #17658 |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
So should we let the barrier's exception fail repair, or should we do something else here (retry)? |
Delete 10s timeout from read barrier in table_sync_and_check, so that the function always considers all previous group0 changes. Fixes: scylladb#18613.
Closing because of the above |
We can start by failing repair, but the barrier should have a generous timeout so that it doesn't fail just because group0 is slow to catch up. |
table_sync_and_check triggers read barrier to synchronize schema and check whether a given table exists. If the barrier fails, e.g. due to timeout, barrier's exception is propagated.
Swallow exception and fall back on local schema check in case of barrier failure.
Fixes: #18490.
No backport needed.