-
Notifications
You must be signed in to change notification settings - Fork 2.6k
etcd member removal fails in 3 node cluster #13498
Description
Environmental Info:
K3s Version:
k3s 1.33.7
Node(s) CPU architecture, OS, and Version:
n/a
Cluster Configuration:
3 etcd+cp nodes
Describe the bug:
When removing nodes from the cluster, after removing 2 nodes the cluster becomes unhealthy. ETCD shows issues with leader election in the logs.
this does not happen consistently, in about ~3/10 tests i have done i have been able to reproduce the issue. Log collection is somewhat difficult since this is done via automated test framework, but the end of the k3s logs show etcd stuck in a loop where leader election keeps failing.
Steps To Reproduce:
- Installed K3s
- set up 3 node cluster
- remove 2 nodes
Expected behavior:
k3s removes member node from etcd when node is removed from cluster.
Actual behavior:
After removal of 2 nodes, the cluster becomes unresponsive. k3s logs shows that ETCD gets stuck in a loop trying to elect members:
k3s[28091]: time="2026-01-23T23:51:40Z" level=info msg="Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:40.971015Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 is starting a new election at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:40.971355Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 became pre-candidate at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:40.971370Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 received MsgPreVoteResp from 8adc34f2f139fc91 at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:40.971382Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 [logterm: 5, index: 15800] sent MsgPreVote request to d082830930c9dd1c at term 5"}
k3s[28091]: {"level":"warn","ts":"2026-01-23T23:51:42.219939Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d082830930c9dd1c","rtt":"0s","error":"dial tcp 192.168.0.52:2380: connect: no route to host"}
k3s[28091]: {"level":"warn","ts":"2026-01-23T23:51:42.220105Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d082830930c9dd1c","rtt":"0s","error":"dial tcp 192.168.0.52:2380: connect: no route to host"}
k3s[28091]: time="2026-01-23T23:51:45Z" level=error msg="Failed to check local etcd status for learner management: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
k3s[28091]: time="2026-01-23T23:51:45Z" level=info msg="Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:46.471187Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 is starting a new election at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:46.471443Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 became pre-candidate at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:46.471471Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 received MsgPreVoteResp from 8adc34f2f139fc91 at term 5"}
k3s[28091]: {"level":"info","ts":"2026-01-23T23:51:46.471489Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8adc34f2f139fc91 [logterm: 5, index: 15800] sent MsgPreVote request to d082830930c9dd1c at term 5"}
k3s[28091]: {"level":"warn","ts":"2026-01-23T23:51:47.221072Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d082830930c9dd1c","rtt":"0s","error":"dial tcp 192.168.0.52:2380: connect: no route to host"}
k3s[28091]: {"level":"warn","ts":"2026-01-23T23:51:47.221226Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d082830930c9dd1c","rtt":"0s","error":"dial tcp 192.168.0.52:2380: connect: no route to host"}
k3s[28091]: time="2026-01-23T23:51:50Z" level=info msg="Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Additional context / logs:
Since it seems there is only one "ghost" member being sent a vote request, I assume that one of the nodes was removed from etcd successfully as a member but the other remained. There was an issue on this previously: #12908, so I am wondering if it is possible there is an issue with the local k3s setup and if there are any better steps I could take to diagnose the issue?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status