-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scylla coredumped (on_internal_error) after log message "table - Unable to load SSTable" #22954
Comments
@denesb - please assign |
@timtimb0t please provide more context. What the test was doing? Load and stream? Repair? |
@raphaelsc , latest nemesis before failure was disrupt_destroy_data_then_rebuild. It stops the Scylla service on the target node and remove non system sstables files, then it starts scylla and initiates the rebuild process (it was successful). Then (after 0,5 hour) disrupt_mgmt_repair_cli nemesis started. It created repair task on target node and failed (seems like due to target node coredumped) |
the problem might be the test using old unsafe repair API, which is inheritenly racy with other tablet-related activities like split. @asias we're going to deprecate the old repair api, and tablet tests should all switch to new one right? |
should we wire our native nodetool repair tool into the new api? |
#22905 ? But this is not good enough. We can't have 'unsafe' repair API - either we fix it (and get the above into 2025.1.0!) or we make it safe. |
AFAIK, the plan is to deprecate the old unsafe api. if repair doesn't go through coordinator like it does with new api, we're susceptible to the races. |
@timtimb0t - please confirm that the repair was done via the Manager. Also - which manager version was used here? |
Here is the related issue in manager repo scylladb/scylla-manager#4188 PR is not set for it yet. |
Repair was initiated via Manager but never finished.
Node log (1 min after the coredump):
Manager version according to log:
|
Packages
Scylla version:
2025.1.0~rc2-20250216.6ee17795783f
with build-id8fc682bcfdf0a8cd9bc106a5ecaa68dce1c63ef6
Kernel Version:
6.8.0-1021-aws
Issue description
Scylla core dumped during disrupt_mgmt_repair_cli nemesis. Not sure what sequence led to it (but nemesis created repair tasks for scylla)
decoded:
Important that nodes 1, 2, 3, 5, 6, 11 were affected at the same time
Impact
Node core dumped
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 8 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-089e047033a16995a ami-0c34f939e95d0c640 ami-06abf4cbfd7fb8f20
(aws: undefined_region)Test:
longevity-multi-dc-rack-aware-with-znode-dc-test
Test id:
17b48f6e-2559-4fc7-b6ee-7c717c3acda7
Test name:
scylla-2025.1/longevity/longevity-multi-dc-rack-aware-with-znode-dc-test
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 17b48f6e-2559-4fc7-b6ee-7c717c3acda7
$ hydra investigate show-logs 17b48f6e-2559-4fc7-b6ee-7c717c3acda7
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: