You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
While testing the scale-up feature I observed that after other 2 pods starts running fine, 0th pod goes to CrashLoopBackOff due to panic in its etcd container. And backup-restore didn’t detect any db corruption.
Expected behavior:
If data corruption occurs then backup-restore should be able to detect the db corruption and should have taken required action.
How To Reproduce (as minimally and precisely as possible):
Do scale-up of etcd cluster from .spec.replicas=1 to .spec.replicas=3
try many times, you might see 2 other members join the cluster but
time="2023-05-02T09:28:42Z" level=info msg="Validation mode: full" actor=backup-restore-server
time="2023-05-02T09:28:42Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
time="2023-05-02T09:28:42Z" level=info msg="Checking whether etcd cluster is marked for scale-up" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Checking if member etcd-main-0 is part of a running cluster" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Member etcd-main-0 part of running cluster" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Checking for data directory structure validity..."
time="2023-05-02T09:28:42Z" level=info msg="Checking for Etcd revision..."
time="2023-05-02T09:28:42Z" level=info msg="Skipping check for revision consistency of etcd member as it will get in sync with etcd leader."
time="2023-05-02T09:28:42Z" level=info msg="Checking for data directory files corruption..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying snap directory..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying WAL directory..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying DB file..."
time="2023-05-02T09:28:42Z" level=info msg="Data directory valid."
time="2023-05-02T09:28:42Z" level=info msg="Successfully initialized data directory for etcd." actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="Responding to status request with: Successful" actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="Updating status from Successful to New" actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="checking the presence of a learner in a cluster..." actor=member-add
time="2023-05-02T09:28:43Z" level=info msg="Served config for ETCD instance." actor=backup-restore-server
Etcd Logs:
2023-05-02 09:33:48.983240 I | etcdserver: restarting member a789d1ab8da2e800 in cluster ffa0021a34a4dfd7 at commit index 0
raft2023/05/02 09:33:48 INFO: a789d1ab8da2e800 switched to configuration voters=()
raft2023/05/02 09:33:48 INFO: a789d1ab8da2e800 became follower at term 0
raft2023/05/02 09:33:48 INFO: newRaft a789d1ab8da2e800 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2023-05-02 09:33:48.986153 W | auth: simple token is not cryptographically signed
2023-05-02 09:33:48.988626 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
2023-05-02 09:33:48.991139 I | embed: listening for peers on [::]:2380
2023-05-02 09:33:49.052926 I | rafthttp: started HTTP pipelining with peer a9dfaad6ed1426aa
2023-05-02 09:33:49.052991 E | rafthttp: failed to find member a9dfaad6ed1426aa in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.053298 E | rafthttp: failed to find member a9dfaad6ed1426aa in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.056043 I | rafthttp: started HTTP pipelining with peer 2d3b31aaf470abd0
2023-05-02 09:33:49.056102 E | rafthttp: failed to find member 2d3b31aaf470abd0 in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.056381 E | rafthttp: failed to find member 2d3b31aaf470abd0 in cluster ffa0021a34a4dfd7
raft2023/05/02 09:33:49 INFO: a789d1ab8da2e800 [term: 0] received a MsgHeartbeat message with higher term from a9dfaad6ed1426aa [term: 3]
raft2023/05/02 09:33:49 INFO: a789d1ab8da2e800 became follower at term 3
raft2023/05/02 09:33:49 tocommit(22) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(22) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 125 [running]:
log.(*Logger).Panicf(0xc0001318b0, 0x10db3a0, 0x5d, 0xc000546260, 0x2, 0x2)
/usr/local/go/src/log/log.go:219 +0xc1
go.etcd.io/etcd/raft.(*DefaultLogger).Panicf(0x1ac1310, 0x10db3a0, 0x5d, 0xc000546260, 0x2, 0x2)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/logger.go:127 +0x60
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc00024d110, 0x16)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/log.go:203 +0x131
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:1396 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:1341 +0x480
go.etcd.io/etcd/raft.(*raft).Step(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:984 +0x1113
go.etcd.io/etcd/raft.(*node).run(0xc00007faa0)
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/node.go:352 +0xab6
created by go.etcd.io/etcd/raft.RestartNode
/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/node.go:240 +0x33c
Answer we need to know to find-out?:
It should be simple pod restart, why kind of corruption occurs ? what etcd was expecting, so why it got panic if its not corruption ?
Why it happens after scaling up and other 2 members joins ?
The text was updated successfully, but these errors were encountered:
Describe the bug:
While testing the scale-up feature I observed that after other 2 pods starts running fine,
0th
pod goes toCrashLoopBackOff
due to panic in its etcd container. And backup-restore didn’t detect any db corruption.Expected behavior:
If data corruption occurs then backup-restore should be able to detect the db corruption and should have taken required action.
How To Reproduce (as minimally and precisely as possible):
.spec.replicas=1
to.spec.replicas=3
Backup-restore Logs:
Etcd Logs:
Answer we need to know to find-out?:
It should be simple pod restart, why kind of corruption occurs ? what etcd was expecting, so why it got panic if its not corruption ?
Why it happens after scaling up and other 2 members joins ?
The text was updated successfully, but these errors were encountered: