-
Notifications
You must be signed in to change notification settings - Fork 2.6k
v1.31.2+k3s1 Failed to get logs on some specific nodes of the cluster #11847
-
|
Environmental Info: Node(s) CPU architecture, OS, and Version: Cluster Configuration: Describe the bug: What is happening is that depending on which server is configured in my Where What I have tested:
Here is the kube-controller-manager-arg:
- bind-address=0.0.0.0
kube-proxy-arg:
- metrics-bind-address=0.0.0.0
kube-scheduler-arg:
- bind-address=0.0.0.0
etcd-expose-metrics: true
cluster-init: true
kubelet-arg:
- container-log-max-files=5
- container-log-max-size=20Mi
secrets-encryption: true
disable:
- servicelb
- traefikAnd node 2 is very similar but without the But if I try to get logs from other nodes things work fine, is just this weird problem that happens some specific server nodes and agents. The weird thing is why can I access the node using Steps To Reproduce:
Expected behavior: All serve nodes should be able to communicate equally with all other nodes. Actual behavior: Some server nodes can communicate while others can't. Additional context / logs: |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments · 30 replies
-
|
Agents need to be able to connect to ALL of the servers. This is because the agent creates websocket tunnels to the server, and the servers use these tunnels to connect back to the kubelet in order to handle You're also a couple months out of date; update to a newer release and see if you can still reproduce this. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
The logs show it only connecting to one of the three addresses. Why is this node unable to connect to the other two? Are you able to successfully test all three addresses with Servers will not get "randomly" removed from that list. If they are removed from the list, then the apiserver is not functional on that node. Check the server logs to see what else is going on in that time frame. Agents should always have an active connection to the proxy endpoint all servers. If they do not, that server will not be able to connect to them. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Yes, pinging the three mngr nodes from both At the same time as the error logs from DetailsThis looks to me like Details |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
K3s panicked because it wasn't able to update one of the controller leases in etcd within the expected timeout. This generally indicates that your disk performance is incapable of supporting the load that you're putting on them. If etcd isn't stable and performing within expected parameters, nothing else will work right either. Is this the first you're noticing that k3s on your server nodes is crashing and restarting? Are you not monitoring this anywhere? |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Disk performance could very well be the reason since we already know that this is an issue for our nodes. We are not directly monitoring k3s on our servers at the moment. The up metrics for the kube api server however show 3 instances in the last two months with about 1 min of downtime which is probably the k3s service restarting. So far we did not have any alerts and thus did not yet notice this. I am still not sure though if this is also the reason for the initial problem, or if this is maybe a completely different and unrelated issue because there is no time correlation between those two. I also tested again if a restart of the problematic We will discuss and increase our monitoring of the k8s core components for now and probably add some kind of monitoring for the k3s services. We'll also see if we can increase our disk performance. If the issue arises again, I'll see if I can gather more logs and data to provide here. Big thank you for your time and very valuable help! |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
I'm on a similar situation. AFAICS it seems like at some point a process demands high resources, systemd-oomd kills something, and from then onwards the agent starts giving these errors. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
We are observing similar behavior on v1.33.1+k3s. Sporadically, all pods on a single node respond with: "proxy error from 127.0.0.1:6443 while dialing 10.20.49.100:10250, code 502: 502 Bad Gateway.", when streaming logs. No OOM kills or other errors are present in the logs aside from this message in journald. Server metrics show no anomalies. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Check the logs on the node that you are unable to retrieve logs from. This message indicates that the websocket tunnel from the node running the pod, to the node running the apiserver, has been disconnected. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
Thank you for your prompt reply. I wonder if there is a specific log string or event I can query to detect websocket tunnel disconnection? We use the same setup across many smaller clusters where developers don’t consistently open log streams. I want to configure alerting for immediate notification to help isolate under what conditions this occurs. |
Beta Was this translation helpful? Give feedback.
All reactions
-
|
You should see messages like this: The Once it does reconnect, you will see additional messages regarding the server health. If the nodes are running with
|
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
This discussion was converted from issue #11846 on February 25, 2025 07:49.
Disk performance could very well be the reason since we already know that this is an issue for our nodes. We are not directly monitoring k3s on our servers at the moment. The up metrics for the kube api server however show 3 instances in the last two months with about 1 min of downtime which is probably the k3s service restarting. So far we did not have any alerts and thus did not yet notice this.
I am still not sure though if this is also the reason for the initial problem, or if this is maybe a completely different and unrelated issue because there is no time correlation between those two.
I also tested again if a restart of the problematic
k3s-agentservice would cause the problem to o…