Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replica: swallow barrier exceptions in table_sync_and_check #18613

Closed
wants to merge 1 commit into from

Conversation

Deexie
Copy link
Contributor

@Deexie Deexie commented May 10, 2024

table_sync_and_check triggers read barrier to synchronize schema and check whether a given table exists. If the barrier fails, e.g. due to timeout, barrier's exception is propagated.

Swallow exception and fall back on local schema check in case of barrier failure.

Fixes: #18490.

  • ** Backport reason (please explain below if this patch should be backported or not) **
    No backport needed.

@Deexie Deexie requested a review from tgrabiec as a code owner May 10, 2024 11:24
@Deexie Deexie added the backport/none Backport is not required label May 10, 2024
@Deexie Deexie self-assigned this May 10, 2024
@Deexie Deexie requested review from denesb and removed request for tgrabiec May 10, 2024 11:24
// Trigger read barrier to synchronize schema.
co_await mm.get_group0_barrier().trigger(as);
}
} catch (...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to swallow any exception here, just some specific exceptions, like the barrier timing out.
For this reason, we also shouldn't enclose all this code in try..catch, just the barrier itself.

table_sync_and_check triggers read barrier to synchronize schema
and checks whether a given table exists. If the barrier times out,
service::raft_operation_timeout_error is propagated.

Swallow exception and fall back on local schema check in case of barrier
timeout.

Fixes: scylladb#18490.
@Deexie
Copy link
Contributor Author

Deexie commented May 10, 2024

  • swallow only service::raft_operation_timeout_error

@tgrabiec
Copy link
Contributor

The barrier is there for correctness, we want repair to see all previous group0 changes, so we cannot ignore its failure.

If we don't see previous changes, repair may think table is dropped while it's actually not yet created.

There are other reasons mentioned here: #17658

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 3 hr 20 min
  • Builder: spider8.cloudius-systems.com

@denesb
Copy link
Contributor

denesb commented May 13, 2024

The barrier is there for correctness, we want repair to see all previous group0 changes, so we cannot ignore its failure.

So should we let the barrier's exception fail repair, or should we do something else here (retry)?

Deexie added a commit to Deexie/scylla that referenced this pull request May 20, 2024
Delete 10s timeout from read barrier in table_sync_and_check,
so that the function always considers all previous group0 changes.

Fixes: scylladb#18613.
@mykaul mykaul added this to the 6.0 milestone May 20, 2024
@Deexie
Copy link
Contributor Author

Deexie commented May 20, 2024

Closing because of the above

@Deexie Deexie closed this May 20, 2024
@tgrabiec
Copy link
Contributor

The barrier is there for correctness, we want repair to see all previous group0 changes, so we cannot ignore its failure.

So should we let the barrier's exception fail repair, or should we do something else here (retry)?

We can start by failing repair, but the barrier should have a generous timeout so that it doesn't fail just because group0 is slow to catch up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/repair backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nodetool repair command failed with exit code3 during drop keyspace
5 participants