ETCD keeps throwing context dealine exceeded continuously and it started happening suddenly #17394
Replies: 10 comments 10 replies
-
Hi @ahrtr |
Beta Was this translation helpful? Give feedback.
-
Hi @ahrtr
|
Beta Was this translation helpful? Give feedback.
-
It's still network problem. Pls manually try
|
Beta Was this translation helpful? Give feedback.
-
Hi @ahrtr , |
Beta Was this translation helpful? Give feedback.
-
@ahrtr |
Beta Was this translation helpful? Give feedback.
-
Hi @ahrtr , |
Beta Was this translation helpful? Give feedback.
-
@rahulbapumore, I'm assuming this is not Kubernetes' etcd, right? It's a StatefulSet with a single replica that you created. What's the Kubernetes version you're using? If there's a network error, knowing what CNI and version you're using would also be helpful. |
Beta Was this translation helpful? Give feedback.
-
new logs after reproduction |
Beta Was this translation helpful? Give feedback.
-
This Q&A is continuation of #17364 ,
ETCD v3.5.7 is installed in container of kubernetes pod controlled by statefulset and it has only one replica. this deployment works fine for few days(2-3), but suddenly etcd starts throwing context deadline error, no etcdctl command works. When tried to recover this deployment, we tried to delete wal and snap file (but not db file), it doesnt get recovered. But as soon as we delete db file , etcd starts again without an issue. Our primary suspect is db is being corrupted automatically or somehow. But This corruption in db is very risky loosing all data. Do you know how to recover from this? We didnt find much from logs, just keeps printing context deadline exceeded.
Also when we exported ETCDCTL_API=2 env variable and then run etcdctl command, we got below error ->
client: etcd cluster is unavailable or misconfigured; error #0: unsupported protocol scheme "eddc.namespace1"
We tried running bbolt check on the same deployment, it printed Ok it means there is no corruption in databse.
We need some helm for recovering this deployment.
Current status is that etcdctl member list command works fine, no restart of etcd were seen.
But etcdctl put/get/endpoint status/endpoint health/alarm list commands fail giving timeout error.
Beta Was this translation helpful? Give feedback.
All reactions