ETCD keeps throwing context dealine exceeded continuously and it started happening suddenly #17394

rahulbapumore · 2024-02-08T05:06:52Z

rahulbapumore
Feb 8, 2024

This Q&A is continuation of #17364 ,
ETCD v3.5.7 is installed in container of kubernetes pod controlled by statefulset and it has only one replica. this deployment works fine for few days(2-3), but suddenly etcd starts throwing context deadline error, no etcdctl command works. When tried to recover this deployment, we tried to delete wal and snap file (but not db file), it doesnt get recovered. But as soon as we delete db file , etcd starts again without an issue. Our primary suspect is db is being corrupted automatically or somehow. But This corruption in db is very risky loosing all data. Do you know how to recover from this? We didnt find much from logs, just keeps printing context deadline exceeded.

Also when we exported ETCDCTL_API=2 env variable and then run etcdctl command, we got below error ->

client: etcd cluster is unavailable or misconfigured; error #0: unsupported protocol scheme "eddc.namespace1"

We tried running bbolt check on the same deployment, it printed Ok it means there is no corruption in databse.
We need some helm for recovering this deployment.
Current status is that etcdctl member list command works fine, no restart of etcd were seen.
But etcdctl put/get/endpoint status/endpoint health/alarm list commands fail giving timeout error.

rahulbapumore · 2024-02-08T14:39:03Z

rahulbapumore
Feb 8, 2024
Author

Hi @ahrtr
Could you please help

0 replies

rahulbapumore · 2024-02-09T06:55:19Z

rahulbapumore
Feb 9, 2024
Author

Hi @ahrtr
We checked in the same deployment regarding "balancer error: connection refused" error , we did not find any network blockage there.
from inside the pod , we were able to connect ip:port mentioned inside the log. We are completely blocked, 4 namespaces are impacted and only etcdctl member list, member add,member remove commands are working, All other put/get commands are blocked, it just prints request time out after executing command -

bash-4.4$ etcdctl put key1 val1 {"level":"warn","ts":"2024-02-08T23:52:01.535-0600","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00001ea80/edce.namespace1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"} Error: etcdserver: request timed out

0 replies

rahulbapumore · 2024-02-09T07:00:09Z

rahulbapumore
Feb 9, 2024
Author

logs.log

0 replies

ahrtr · 2024-02-09T10:13:45Z

ahrtr
Feb 9, 2024
Maintainer

It's still network problem. Pls manually try 172.30.66.129:2379

{"attempt":0,"caller":"[email protected]/retry_interceptor.go:62","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.30.66.129:2379: connect: connection refused\"","logger":"etcd-client","message":"retrying of unary invoker failed","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"edce-0"},"service_id":"edce","severity":"warning","target":"etcd-endpoints://0xc0004f6700/edce.namespace1:2379","timestamp":"2024-02-08T08:29:21.734-06:00","version":"1.2.0"}
{"message":"Error: context deadline exceeded","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"edce-0"},"service_id":"edce","severity":"error","timestamp":"2024-02-08T08:29:21.734-06:00","version":"1.2.0"}
{"caller":"zapgrpc/zapgrpc.go:191","message":"[core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\"","metadata":{"container_name":"dced","namespace":"namespace1","pod_name":"edce-0"},"service_id":"edce","severity":"warning","timestamp":"2024-02-08T08:32:17.534-06:00","version":"1.2.0"}

0 replies

rahulbapumore · 2024-02-09T10:35:04Z

rahulbapumore
Feb 9, 2024
Author

Hi @ahrtr ,
We tried manually only from inside the container..it was able to connect to 172.30.66.129:2379 so it's not an network issue

2 replies

ahrtr Feb 9, 2024
Maintainer

Not sure what's your use case. I guess the communication (client to 172.30.66.129:2379) might be across POD/container or even VM?

cc @ivanvc are you able to follow up this discussion? thx

ivanvc Feb 9, 2024
Collaborator

I still need to become an expert on the project. But I can help troubleshoot and narrow down the issue.

rahulbapumore · 2024-02-09T11:20:11Z

rahulbapumore
Feb 9, 2024
Author

@ahrtr
172.30.66.129 this is cluster ip of the service and etcdctl is trying to connect to it. We have just one replica so it will be on one worker node, so from inside the container etcdctl is trying to access to the clusterip(that worker node where is pod is scheduled itself is the part of cluster).So there shouldnt be any communication gap according to me

0 replies

rahulbapumore · 2024-02-09T14:02:05Z

rahulbapumore
Feb 9, 2024
Author

Hi @ahrtr ,
We could reproduce this issue, we have etcd running in one container and the pod for that container is controlled by statefulset.
So we have just one replica. We kept this one replica for 3 days and issue was reproduced.
etcdctl member list command is working , etcdctl put command is not working.
Common error we could see which is constantly printing in the logs is "transport is closing"
So I think this could be significant bug in 3.5.7 version when etcd is deployed as one replica.
I am attaching the logs here
logstoattach.txt

2 replies

ahrtr Feb 10, 2024
Maintainer

etcdctl member list command is working , etcdctl put command is not working

This doesn't make sense. A couple of suggestions,

try new version, e.g 3.5.12
manually execute etcdctl put, and dump all the etcd logs related to the etcdctl put execution

so from inside the container etcdctl is trying to access to the clusterip(that worker node where is pod is scheduled itself is the part of cluster)

From inside the container, you can just 127.0.0.1. Since you are connecting to the cluster IP, It might be CNI's problem.

rahulbapumore Feb 12, 2024
Author

Hi @ahrtr ,
We will try to reproduce issue on 3.5.12 later, bu we already have installed 3.5.7 and keeping one replica and leaving it idle, we are able to reproduce the issue, and inside logs it just keeps printing transport is closing.
So whats exact issue? and how to recover from it,and as I mentioned earlier its not network related issue.
If you can check latest logs I have attached in this comment doesnt have any connection refused error, its something different and reproducible only with one replica not with distributed etcd deployment. Please help with this
logstoattach.txt

ivanvc · 2024-02-09T20:08:23Z

ivanvc
Feb 9, 2024
Collaborator

@rahulbapumore, I'm assuming this is not Kubernetes' etcd, right? It's a StatefulSet with a single replica that you created. What's the Kubernetes version you're using? If there's a network error, knowing what CNI and version you're using would also be helpful.
Reading your reported issue, I saw you ran bbolt to get diagnostics. What version of bbolt did you use? What was the output (it would also be helpful to check that output)?

6 replies

ivanvc Feb 13, 2024
Collaborator

Can you build bbolt 1.3.8 rather than the main branch and rerun the command? bbolt's check always returns something, either the error or prints OK. So you should always see an output.

rahulbapumore Feb 13, 2024
Author

Hi @ivanvc
We ran bbolt check command, it printed OK it means db is not corrupt.
But there is something wrong happening related to cluster health of etcd. And it happens automatically when etcd is run as single instance.
Transport is closing is the main error message which gets printed in the logs continuously.
Could you please help in understanding regarding the issue or if any recovery procedure to recover it?

Thanks

ivanvc Feb 14, 2024
Collaborator

Did you update to 3.5.12, as Benjamin suggested? What's the Kubernetes version you're using? And the CNI and its version?

rahulbapumore Feb 15, 2024
Author

We havent moved our codebase to 3.5.12 yet, do you see any bug or issue in 3.5.7? due to which this might be happening?
kubernetes version is 1.28
we are using konnectivity-agent

ivanvc Feb 16, 2024
Collaborator

I would do as Benjamin suggested. You can update to 3.5.12 using the same database. If you care about data recovery, some tools in bbolt (see keys, get) can help you retrieve the data straight from bbolt, and you can later restore it in a new instance. I just checked what version of bbolt etcd 3.5.7 is using, and you should use bbolt 1.3.6 instead. Or, use etcdutl, but as you can't reach the node it may not work.

rahulbapumore · 2024-02-12T04:52:45Z

rahulbapumore
Feb 12, 2024
Author

new logs after reproduction
Uploading logstoattach.txt…

0 replies

rahulbapumore · 2024-02-13T17:56:43Z

rahulbapumore
Feb 13, 2024
Author

@ahrtr @ivanvc
Any way to recover from the cluster?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD keeps throwing context dealine exceeded continuously and it started happening suddenly #17394

{{title}}

Replies: 10 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ETCD keeps throwing context dealine exceeded continuously and it started happening suddenly #17394

rahulbapumore Feb 8, 2024

Replies: 10 comments · 10 replies

rahulbapumore Feb 8, 2024 Author

rahulbapumore Feb 9, 2024 Author

rahulbapumore Feb 9, 2024 Author

ahrtr Feb 9, 2024 Maintainer

rahulbapumore Feb 9, 2024 Author

ahrtr Feb 9, 2024 Maintainer

ivanvc Feb 9, 2024 Collaborator

rahulbapumore Feb 9, 2024 Author

rahulbapumore Feb 9, 2024 Author

ahrtr Feb 10, 2024 Maintainer

rahulbapumore Feb 12, 2024 Author

ivanvc Feb 9, 2024 Collaborator

ivanvc Feb 13, 2024 Collaborator

rahulbapumore Feb 13, 2024 Author

ivanvc Feb 14, 2024 Collaborator

rahulbapumore Feb 15, 2024 Author

ivanvc Feb 16, 2024 Collaborator

rahulbapumore Feb 12, 2024 Author

rahulbapumore Feb 13, 2024 Author

rahulbapumore
Feb 8, 2024

Replies: 10 comments 10 replies

rahulbapumore
Feb 8, 2024
Author

rahulbapumore
Feb 9, 2024
Author

rahulbapumore
Feb 9, 2024
Author

ahrtr
Feb 9, 2024
Maintainer

rahulbapumore
Feb 9, 2024
Author

ahrtr Feb 9, 2024
Maintainer

ivanvc Feb 9, 2024
Collaborator

rahulbapumore
Feb 9, 2024
Author

rahulbapumore
Feb 9, 2024
Author

ahrtr Feb 10, 2024
Maintainer

rahulbapumore Feb 12, 2024
Author

ivanvc
Feb 9, 2024
Collaborator

ivanvc Feb 13, 2024
Collaborator

rahulbapumore Feb 13, 2024
Author

ivanvc Feb 14, 2024
Collaborator

rahulbapumore Feb 15, 2024
Author

ivanvc Feb 16, 2024
Collaborator

rahulbapumore
Feb 12, 2024
Author

rahulbapumore
Feb 13, 2024
Author