-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"failed to reserve container name" #7690
Comments
Oh yeah, there are new containers created over and over, but at all times, there are only two:
|
I was unable to replicate this on a pi4 with v1.25.11+k3s1.
|
I am going to close this since we're not able to reproduce - we can reopen if new details become available. |
@dereknola Thank you so much for going through the trouble of trying to replicate on actual hardware. I still have this problem with no improvement. To concretise my particular problem, it is my homeassistant deployment that is problematic. Anytime I try to change the deployment it gets stuck like this. Before, it might have resolved within a couple of hours or a day with luck but its seems worse now. Tried updating the image yesterday and rollout haven't worked in 35h:
The failing pod:
I'm writing again to ask for any kind of pointer on what I could try to investigate. I have found these old issues on containerd that are closed. In particular, the suggestions for mitigation doesn't work for me: containerd/containerd#4604 (comment) My containerd version seems new enough to also have the fix they are talking about, but I'm wondering about the "k3s1" suffix in my version here, is that a patched version for k3s?
If I look at the host's containerd I see that it's still trying over and over with the new image here, and there are still two images for the old deployment which is weird?
k3s knows only about one:
|
@hedefalk v1.25.6+k3s1 is almost a year old, and the whole v1.25 minor version is end of life. Newer releases have long been upgraded to containerd v1.7.x Have you at any point tried upgrading to a version released in the last 12 months? |
I'm hitting this still on v1.30.4+k3s1 I just now used raspberry-pi-imager and put out an entirely new debian system on a ssd attached to another node "pi2". Attached it to my poe-switch and joined the cluster with k3sup:
Immediately stuff's getting scheduled but stuck:
Looking at the traefik pod events:
I recognized the pattern and found this old ticket :) On the node I see something similar as before:
I guess there's really nothing new here and I hate to re-open without any new info, but maybe I could get some direction on what info I could try do dig into? I'd love to get this working… Update: |
Oh. Never mind me, the reason was really slow disk because I accidentally was booting from network. Hadn't successfully fixed bootorder on this node and it was running off of nfs from another machine |
I get this issue on eks 1.29 AL2 or AL2023 latest and it started to dominate our clusters since 1.27 to 1.29 migration. Basically we use single node setup for many springboot services, when they start above certain number this happens and I get continers in containerruntime error when I log to node things look normal, not special messages. if I wait around 30 min resart of containers in this state works but if I restart earlier it doesn't. |
Environmental Info:
K3s Version:
k3s version v1.25.6+k3s1 (9176e03)
go version go1.19.5
Node(s) CPU architecture, OS, and Version:
Linux pi1 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
Cluster Configuration:
Two RPI 4 8GB, 1 server, 1 agent
Describe the bug:
For one of my deployments, I repeatedly get this error "failed to reserve container name xxx, is reserved for yyy" because it seems to be reserved by a previous attempt that might have timed out?
Sometimes it seems it actually deploys after a couple of hours, but iterating over configs makes this a nightmare.
Here are the events of the pod:
If I look at the host machine's containerd I can see that there are actually two containers matching the ids of the kubelet error messages:
After reading on other reports on containerd, my feeling is that there is some kind of mismatch in kubelet and containerd with timeouts so that kubelet retries too early and it all gets congested?
Steps To Reproduce:
This is the deployment, really nothing special
The only thing that comes to mind is that I'm using longhorn for the persistence, looking at #2312 I got the feeling that people had problems with slow NFS.
Additional context / logs:
Similar but closed: #2312
Similar problems on GKE with containerd: containerd/containerd#4604
The text was updated successfully, but these errors were encountered: