Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pod restart of etcd-0 after scale-up failed with panic #620

Open
ishan16696 opened this issue May 3, 2023 · 0 comments
Open

[BUG] Pod restart of etcd-0 after scale-up failed with panic #620

ishan16696 opened this issue May 3, 2023 · 0 comments
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@ishan16696
Copy link
Member

Describe the bug:
While testing the scale-up feature I observed that after other 2 pods starts running fine, 0th pod goes to CrashLoopBackOff due to panic in its etcd container. And backup-restore didn’t detect any db corruption.

Expected behavior:
If data corruption occurs then backup-restore should be able to detect the db corruption and should have taken required action.

How To Reproduce (as minimally and precisely as possible):

  1. Do scale-up of etcd cluster from .spec.replicas=1 to .spec.replicas=3
  2. try many times, you might see 2 other members join the cluster but
etcd-main-0                   1/2     CrashLoopBackOff   19 (2m24s ago)   75m
etcd-main-1                   2/2     Running            0                88m
etcd-main-2                   2/2     Running            2 (88m ago)      88m

Backup-restore Logs:

time="2023-05-02T09:28:42Z" level=info msg="Validation mode: full" actor=backup-restore-server
time="2023-05-02T09:28:42Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
time="2023-05-02T09:28:42Z" level=info msg="Checking whether etcd cluster is marked for scale-up" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Checking if member etcd-main-0 is part of a running cluster" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Member etcd-main-0 part of running cluster" actor=member-add
time="2023-05-02T09:28:42Z" level=info msg="Checking for data directory structure validity..."
time="2023-05-02T09:28:42Z" level=info msg="Checking for Etcd revision..."
time="2023-05-02T09:28:42Z" level=info msg="Skipping check for revision consistency of etcd member as it will get in sync with etcd leader."
time="2023-05-02T09:28:42Z" level=info msg="Checking for data directory files corruption..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying snap directory..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying WAL directory..."
time="2023-05-02T09:28:42Z" level=info msg="Verifying DB file..."
time="2023-05-02T09:28:42Z" level=info msg="Data directory valid."
time="2023-05-02T09:28:42Z" level=info msg="Successfully initialized data directory for etcd." actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="Responding to status request with: Successful" actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="Updating status from Successful to New" actor=backup-restore-server
time="2023-05-02T09:28:43Z" level=info msg="checking the presence of a learner in a cluster..." actor=member-add
time="2023-05-02T09:28:43Z" level=info msg="Served config for ETCD instance." actor=backup-restore-server

Etcd Logs:

2023-05-02 09:33:48.983240 I | etcdserver: restarting member a789d1ab8da2e800 in cluster ffa0021a34a4dfd7 at commit index 0
raft2023/05/02 09:33:48 INFO: a789d1ab8da2e800 switched to configuration voters=()
raft2023/05/02 09:33:48 INFO: a789d1ab8da2e800 became follower at term 0
raft2023/05/02 09:33:48 INFO: newRaft a789d1ab8da2e800 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2023-05-02 09:33:48.986153 W | auth: simple token is not cryptographically signed
2023-05-02 09:33:48.988626 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
2023-05-02 09:33:48.991139 I | embed: listening for peers on [::]:2380
2023-05-02 09:33:49.052926 I | rafthttp: started HTTP pipelining with peer a9dfaad6ed1426aa
2023-05-02 09:33:49.052991 E | rafthttp: failed to find member a9dfaad6ed1426aa in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.053298 E | rafthttp: failed to find member a9dfaad6ed1426aa in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.056043 I | rafthttp: started HTTP pipelining with peer 2d3b31aaf470abd0
2023-05-02 09:33:49.056102 E | rafthttp: failed to find member 2d3b31aaf470abd0 in cluster ffa0021a34a4dfd7
2023-05-02 09:33:49.056381 E | rafthttp: failed to find member 2d3b31aaf470abd0 in cluster ffa0021a34a4dfd7
raft2023/05/02 09:33:49 INFO: a789d1ab8da2e800 [term: 0] received a MsgHeartbeat message with higher term from a9dfaad6ed1426aa [term: 3]
raft2023/05/02 09:33:49 INFO: a789d1ab8da2e800 became follower at term 3
raft2023/05/02 09:33:49 tocommit(22) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(22) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 125 [running]:
log.(*Logger).Panicf(0xc0001318b0, 0x10db3a0, 0x5d, 0xc000546260, 0x2, 0x2)
	/usr/local/go/src/log/log.go:219 +0xc1
go.etcd.io/etcd/raft.(*DefaultLogger).Panicf(0x1ac1310, 0x10db3a0, 0x5d, 0xc000546260, 0x2, 0x2)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/logger.go:127 +0x60
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc00024d110, 0x16)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/log.go:203 +0x131
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:1396 +0x54
go.etcd.io/etcd/raft.stepFollower(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:1341 +0x480
go.etcd.io/etcd/raft.(*raft).Step(0xc0001b9540, 0x8, 0xa789d1ab8da2e800, 0xa9dfaad6ed1426aa, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/raft.go:984 +0x1113
go.etcd.io/etcd/raft.(*node).run(0xc00007faa0)
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/node.go:352 +0xab6
created by go.etcd.io/etcd/raft.RestartNode
	/tmp/etcd-release-3.4.13/etcd/release/etcd/raft/node.go:240 +0x33c

Answer we need to know to find-out?:
It should be simple pod restart, why kind of corruption occurs ? what etcd was expecting, so why it got panic if its not corruption ?
Why it happens after scaling up and other 2 members joins ?

@ishan16696 ishan16696 added the kind/bug Bug label May 3, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 10, 2024
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

2 participants