Skip to content

CNDB-14460: Fix Nodes test flakiness resulting from unsafe interleaving of async operations in test scenarios #1812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2025

Conversation

jkni
Copy link

@jkni jkni commented Jun 13, 2025

What is the issue

The singleton Nodes instance sequences operations that should not overlap by running them on a single-threaded executor. In some cases, these operations are executed in a synchronous manner, where the caller waits on the future. In other cases, they're executed asynchronously by queuing. In some tests, the singleton Nodes instance is shut down and replaced in an unsafe manner, to test cases where a node is restarted. This shut down does not terminate or wait on the executor, as the asynchronous tasks can safely be recovered on node restart. In the tests, however, these asynchronous operations can interleave with the newly created Nodes instance such that the operations no longer have the expected isolation, resulting in test failures.

Async operations can also interleave with the temporary directories backing a Nodes instance being deleted by Junit.

What does this PR fix and why was it fixed

When unsafely replacing the singleton Nodes instance in tests, trigger a shutdown on the executors and await inflight tasks.

When shutting down at the end of a test, await inflight tasks.

@jkni jkni self-assigned this Jun 13, 2025
Copy link

github-actions bot commented Jun 13, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@jkni jkni force-pushed the CNDB-14460 branch 2 times, most recently from 7bba36a to f6fc486 Compare June 16, 2025 14:24
…utors with unsafe set up of new nodes instance/test directory removal on shutdown.
Copy link

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1812 approved by Butler


Approved by Butler
See build details here

@jkni jkni added the core-team label Jun 16, 2025
Copy link

@djatnieks djatnieks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense and look good.

Q: Where are we seeing the flakiness? I may be missing it - I checked Butler and didn't see any NodesTest failures there?

@jkni
Copy link
Author

jkni commented Jun 17, 2025

It's a good question! We've had decent luck with it lately, but there are a few failures on PRs: http://butler-stargazer.aws.dsinternal.org/#/ci/upstream/workflow/ds-cassandra-pr-gate/failure/org.apache.cassandra.nodes/NodesTest?test_name=testPeersSerialization. This was also (for whatever timing reason) failing fairly reliably for me as I was doing local development.

@jkni jkni merged commit e0a509f into main Jun 17, 2025
487 of 491 checks passed
@jkni jkni deleted the CNDB-14460 branch June 17, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants