DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

fruch · 2025-02-19T22:40:45Z

Packages

Scylla version: 2025.1.0~rc2-20250216.6ee17795783f with build-id 8fc682bcfdf0a8cd9bc106a5ecaa68dce1c63ef6

Kernel Version: 6.8.0-1021-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

decrease_keyspaces_rf logic isn't aware of multiple regions, and drop a whole region out of the replication:

< t:2025-02-18 08:59:19,352 f:replication_strategy_utils.py l:203  c:sdcm.utils.replication_strategy_utils p:DEBUG > Altering keyspace1 RF with: ALTER KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'eu-westscylla_node_west':2 }                                                                                                               < t:2025-02-18 08:59:19,352 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'ALTER KEYSPACE keyspace1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'eu-westscylla_node_west':2 }' ...

this command can be stuck and timeout, regardless it's the wrong thing todo, and it would drop all the data from one region.

Impact

all case using multiple regions, might fail on any nemesis that remove node first.

How frequently does it reproduce?

100%

Installation details

Cluster size: 6 nodes (i4i.xlarge)

Scylla Nodes used in this run:

long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-8 (35.176.16.52 | 10.3.2.177) (shards: -1)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-7 (34.253.210.248 | 10.4.3.232) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-6 (18.171.61.165 | 10.3.0.203) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-5 (18.171.210.169 | 10.3.2.90) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-4 (3.8.115.132 | 10.3.1.92) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-3 (54.195.144.123 | 10.4.0.90) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-2 (54.247.42.87 | 10.4.1.117) (shards: 4)
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-1 (54.217.0.78 | 10.4.1.156) (shards: 4)

OS / Image: ami-089e047033a16995a ami-0c34f939e95d0c640 (aws: undefined_region)

Test: EaR-longevity-kms-20gb-6h-multidc-test
Test id: 9f46998d-096f-4bf9-81b9-ca36f3002489
Test name: scylla-2025.1/features/EncryptionAtRest/EaR-longevity-kms-20gb-6h-multidc-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-encryption-at-rest-20GB-6h-multidc.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 9f46998d-096f-4bf9-81b9-ca36f3002489
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 9f46998d-096f-4bf9-81b9-ca36f3002489

Logs:

long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-3 - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_065729/long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-3-9f46998d.tar.zst
long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-8 - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_065729/long-encrypt-at-rest-20gb-6h-multid-db-node-9f46998d-8-9f46998d.tar.zst
db-cluster-9f46998d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/db-cluster-9f46998d.tar.zst
sct-runner-events-9f46998d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/sct-runner-events-9f46998d.tar.zst
sct-9f46998d.log.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/sct-9f46998d.log.tar.zst
loader-set-9f46998d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/loader-set-9f46998d.tar.zst
monitor-set-9f46998d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/monitor-set-9f46998d.tar.zst
ssl-conf-9f46998d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/20250218_132009/ssl-conf-9f46998d.tar.zst
builder-9f46998d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9f46998d-096f-4bf9-81b9-ca36f3002489/upload_20250218_132159/builder-9f46998d.log.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

yarongilor · 2025-02-20T14:53:33Z

in the past we had some issues testing such a scenario on multi-dc like scylladb/scylladb#20282
in the PR of #7625
We added a dtest for a single-dc scenario. perhaps we should also enhance Dtest to cover multi-dc.

fruch · 2025-02-20T15:19:36Z

in the past we had some issues testing such a scenario on multi-dc like scylladb/scylladb#20282 in the PR of #7625 We added a dtest for a single-dc scenario. perhaps we should also enhance Dtest to cover multi-dc.

maybe the coredump was cause by the fact you drop a whole region out of replication ?

regardless if it could be tested or not, it's need to be solved, and it's doesn't have anything todo with dtest

code in `_alter_keyspace_rf` was reffering only to one DC in the ALTER command, that lead to RF=0 on other DCs this complication the situation for scylla, and surface multiple bug in operation that are talking place during this operation also is kind of change is not the intended behavier of this code so now it would update the DC RF as needed, while leaving the other DCs RF the same as they were Fix: scylladb#10136

fruch assigned yarongilor Feb 19, 2025

github-actions bot assigned fruch Feb 19, 2025

fruch removed their assignment Feb 19, 2025

fruch linked a pull request Feb 20, 2025 that will close this issue

fix(DataCenterTopologyRfControl): stop taking region to RF=0 #10153

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

fruch commented Feb 19, 2025

Logs:

yarongilor commented Feb 20, 2025

fruch commented Feb 20, 2025

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

DataCenterTopologyRfControl doesn't work as expect in case of multi-dc #10136

Comments

fruch commented Feb 19, 2025

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

yarongilor commented Feb 20, 2025

fruch commented Feb 20, 2025