-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodetool repair command failed with exit code3 during drop keyspace #18490
Comments
Yes, it looks like this. That means that repair failed anyway, but probably with some meaningful expected error. I think the raft error should be caught and the original one rethrown, right? |
Possibly yes.
The original error is probably due to the keyspace being dropped. In which case we don't want to raise it either. |
But if the barrier timeouts, we can still not be updated and have no way to find it out |
Right, so I think we should swallow the barrier being timed out, and just work with what we have (local schema) to check for dropped keyspace/table. |
table_sync_and_check triggers read barrier to synchronize schema and check whether a given table exists. If the barrier fails, e.g. due to timeout, barrier's exception is propagated. Swallow exception and fall back on local schema check in case of barrier failure. Fixes: scylladb#18490.
table_sync_and_check triggers read barrier to synchronize schema and checks whether a given table exists. If the barrier times out, service::raft_operation_timeout_error is propagated. Swallow exception and fall back on local schema check in case of barrier timeout. Fixes: scylladb#18490.
table_sync_and_check triggers read barrier to synchronize schema and checks whether a given table exists. If the barrier times out, service::raft_operation_timeout_error is propagated. Swallow exception and fall back on local schema check in case of barrier timeout. Fixes: scylladb#18490.
read barrier timeout happened also when replacing node during add-drop column nemesis:
Can it be the same? PackagesScylla version: Kernel Version: Installation detailsCluster size: 12 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Yes, it looks like it is the same. @tgrabiec said we cannot just swallow this error, so not sure how should we proceed here. Do we need to investigate each such timeout? |
What's the timeout currently? Maybe it's too short. |
It's 10 seconds |
We've seen schema changes take more than that in the past, which could explain why barrier times out. What's the reasoning behind 10s timeout? I think repair doesn't have to time out on group0 barrier at all, provided that it reliably eventually completes. If user wants to give up, he should cancel repair. |
It was meant to be a fallback in case half of the cluster was down. But if we do not want to proceed with local data anyway, there is no point. I will delete that. |
Delete 10s timeout from read barrier in table_sync_and_check, so that the function always considers all previous group0 changes. Fixes: scylladb#18490.
Packages
Scylla version:
5.5.0~dev-20240419.a5dae74aee4b
with build-id13e359437a7035b8b40bcd744a70b6abddbc491b
Kernel Version:
5.15.0-1060-aws
Issue description
Issue looks like a regression.
During nemesis no_corrupt_repair, nodetool repair is started in background and while it running keyspaces are dropped.
The drop keyspace failed with error:
Similar to issue #18479, but also
nodetool repair failed with error:
and messages found in log of node 1:
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 4 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-00f7c6bc1f946636b
(aws: undefined_region)Test:
longevity-twcs-48h-test
Test id:
e5517f29-ad39-4fcf-8203-8a8404416893
Test name:
scylla-master/tier1/longevity-twcs-48h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor e5517f29-ad39-4fcf-8203-8a8404416893
$ hydra investigate show-logs e5517f29-ad39-4fcf-8203-8a8404416893
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: