Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

Open
1 of 2 tasks
fruch opened this issue Feb 19, 2025 · 2 comments · May be fixed by #10153
Open
1 of 2 tasks

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

fruch opened this issue Feb 19, 2025 · 2 comments · May be fixed by #10153
Assignees

Comments

@fruch
Copy link
Contributor

fruch commented Feb 19, 2025

Packages

Scylla version: 2025.1.0~rc2-20250216.6ee17795783f with build-id 8fc682bcfdf0a8cd9bc106a5ecaa68dce1c63ef6

Kernel Version: 6.8.0-1021-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

decrease_keyspaces_rf logic isn't aware of multiple regions, and drop a whole region out of the replication:

< t:2025-02-18 08:59:19,352 f:replication_strategy_utils.py l:203  c:sdcm.utils.replication_strategy_utils p:DEBUG > Altering keyspace1 RF with: ALTER KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'eu-westscylla_node_west':2 }                                                                                                               < t:2025-02-18 08:59:19,352 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'ALTER KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'eu-westscylla_node_west':2 }' ...

this command can be stuck and timeout, regardless it's the wrong thing todo, and it would drop all the data from one region.

Impact

all case using multiple regions, might fail on any nemesis that remove node first.

How frequently does it reproduce?

100%

Installation details

Cluster size: 6 nodes (i4i.xlarge)

Scylla Nodes used in this run:

  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-8 (35.176.16.52 | 10.3.2.177) (shards: -1)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-7 (34.253.210.248 | 10.4.3.232) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-6 (18.171.61.165 | 10.3.0.203) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-5 (18.171.210.169 | 10.3.2.90) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-4 (3.8.115.132 | 10.3.1.92) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-3 (54.195.144.123 | 10.4.0.90) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-2 (54.247.42.87 | 10.4.1.117) (shards: 4)
  • long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-1 (54.217.0.78 | 10.4.1.156) (shards: 4)

OS / Image: ami-089e047033a16995a ami-0c34f939e95d0c640 (aws: undefined_region)

Test: EaR-longevity-kms-20gb-6h-multidc-test
Test id: 9f46998d-096f-4bf9-81b9-ca36f3002489
Test name: scylla-2025.1/features/EncryptionAtRest/EaR-longevity-kms-20gb-6h-multidc-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 9f46998d-096f-4bf9-81b9-ca36f3002489
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 9f46998d-096f-4bf9-81b9-ca36f3002489

Logs:

Jenkins job URL
Argus

@fruch fruch removed their assignment Feb 19, 2025
@yarongilor
Copy link
Contributor

in the past we had some issues testing such a scenario on multi-dc like scylladb/scylladb#20282
in the PR of #7625
We added a dtest for a single-dc scenario. perhaps we should also enhance Dtest to cover multi-dc.

@fruch
Copy link
Contributor Author

fruch commented Feb 20, 2025

in the past we had some issues testing such a scenario on multi-dc like scylladb/scylladb#20282 in the PR of #7625 We added a dtest for a single-dc scenario. perhaps we should also enhance Dtest to cover multi-dc.

maybe the coredump was cause by the fact you drop a whole region out of replication ?

regardless if it could be tested or not, it's need to be solved, and it's doesn't have anything todo with dtest

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Feb 20, 2025
code in `_alter_keyspace_rf` was reffering only to one DC
in the ALTER command, that lead to RF=0 on other DCs

this complication the situation for scylla, and surface multiple
bug in operation that are talking place during this operation
also is kind of change is not the intended behavier of this code

so now it would update the DC RF as needed, while leaving the other
DCs RF the same as they were

Fix: scylladb#10136
@fruch fruch linked a pull request Feb 20, 2025 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants