v1.31.2+k3s1 Failed to get logs on some specific nodes of the cluster #11847

Gab-Menezes · 2025-02-25T07:29:05Z

Gab-Menezes
Feb 25, 2025

Environmental Info:
K3s Version: v1.31.2+k3s1

Node(s) CPU architecture, OS, and Version: Linux **** 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 5 servers, 91 agents

Describe the bug:
I think my problem is similar to #10444, #10474, #10547, #10653, but not exactly the same.

What is happening is that depending on which server is configured in my ~/.kube/config I get this error, for example. If I change my ~/.kube/config to use node 1 and try to get logs of pods running on node 58 I get:

Defaulted container "patroni" out of: patroni, envoy, pgbouncer, prometheus-postgres-exporter, postgres-util, cluster-controller, setup-arbitrary-user (init), setup-scripts (init), relocate-binaries (init), pgbouncer-auth-file (init), cluster-reconciliation-cycle (init)
Error from server: Get "https://172.19.35.30:10250/containerLogs/default/pg-prod-2/patroni": proxy error from 127.0.0.1:6443 while dialing 172.19.35.30:10250, code 502: 502 Bad Gateway

Where 172.19.35.30 is the ip of node 58. If I change my ~/.kube/config to use node 2 this problem goes away.

What I have tested:

I'm pretty sure I'm not using the bind-address flag.
I have tested the connectivity between node 1 and 58 and node 2 and 58, by doing a kubectl debug in both, running nc -lvn -p 8090 0.0.0.0 in node 58 and telnet 172.19.35.30 8090 in node 1 and 2 and this works fine.
Even tried running telnet 172.19.35.30 10250 and it's fine.

Here is the /etc/rancher/k3s/config.yaml for node 1:

kube-controller-manager-arg:
  - bind-address=0.0.0.0
kube-proxy-arg:
  - metrics-bind-address=0.0.0.0
kube-scheduler-arg:
  - bind-address=0.0.0.0
etcd-expose-metrics: true
cluster-init: true
kubelet-arg:
  - container-log-max-files=5
  - container-log-max-size=20Mi
secrets-encryption: true
disable:
  - servicelb
  - traefik

And node 2 is very similar but without the cluster-init: true.

But if I try to get logs from other nodes things work fine, is just this weird problem that happens some specific server nodes and agents.

The weird thing is why can I access the node using telnet while running kubectl debug but the not when running kubectl logs ?

Steps To Reproduce:

Installed K3s:

Expected behavior: All serve nodes should be able to communicate equally with all other nodes.

Actual behavior: Some server nodes can communicate while others can't.

Additional context / logs:

Answered by poehlerflorian

Jun 16, 2025

Disk performance could very well be the reason since we already know that this is an issue for our nodes. We are not directly monitoring k3s on our servers at the moment. The up metrics for the kube api server however show 3 instances in the last two months with about 1 min of downtime which is probably the k3s service restarting. So far we did not have any alerts and thus did not yet notice this.

I am still not sure though if this is also the reason for the initial problem, or if this is maybe a completely different and unrelated issue because there is no time correlation between those two.

I also tested again if a restart of the problematic k3s-agent service would cause the problem to o…

View full answer

brandond · 2025-02-25T07:52:36Z

brandond
Feb 25, 2025
Collaborator

Agents need to be able to connect to ALL of the servers. This is because the agent creates websocket tunnels to the server, and the servers use these tunnels to connect back to the kubelet in order to handle kubectl logs and kubectl exec requests. The error you're seeing suggests that agents are not able to connect to some of the servers. Check the logs on the agents for errors regarding the loadbalancer and websocket tunnel.

You're also a couple months out of date; update to a newer release and see if you can still reproduce this.

26 replies

brandond Jun 12, 2025
Collaborator

The logs show it only connecting to one of the three addresses. Why is this node unable to connect to the other two? Are you able to successfully test all three addresses with curl -vks https://<ADDRESS>/ping from this node?

Servers will not get "randomly" removed from that list. If they are removed from the list, then the apiserver is not functional on that node. Check the server logs to see what else is going on in that time frame.

Agents should always have an active connection to the proxy endpoint all servers. If they do not, that server will not be able to connect to them.

poehlerflorian Jun 13, 2025

Yes, pinging the three mngr nodes from both worker7 and worker9 returns pong with status code 200. Pinging mngr nodes from mngr nodes also works fine.

At the same time as the error logs from worker9 above, the mngr1 log also shows something wrong with the api server:

Details

Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.131774   40217 event.go:389] "Event occurred" object="kube-system/k3s-cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="mngr1.redacted.hostname_1d5d3acb-9bb7-46bc-bea7-1cf4f953ba64 became leader"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.131915   40217 leaderelection.go:268] successfully acquired lease kube-system/k3s-cloud-controller-manager
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Creating service-lb-controller event broadcaster"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Starting /v1, Kind=Node controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Starting /v1, Kind=Pod controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: W0612 12:06:17.313586   40217 controllermanager.go:306] "node-route-controller" is disabled
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Starting discovery.k8s.io/v1, Kind=EndpointSlice controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.315708   40217 controllermanager.go:329] Started "cloud-node-controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.316157   40217 node_controller.go:176] Sending events to api server.
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.316281   40217 node_controller.go:185] Waiting for informer caches to sync
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.317331   40217 controllermanager.go:329] Started "cloud-node-lifecycle-controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.317498   40217 node_lifecycle_controller.go:112] Sending events to api server
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.318609   40217 controllermanager.go:329] Started "service-lb-controller"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.318816   40217 controller.go:236] Starting service controller
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.318889   40217 shared_informer.go:313] Waiting for caches to sync for service
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: I0612 12:06:17.418992   40217 shared_informer.go:320] Caches are synced for service
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Stopped tunnel to 172.28.182.15:6443"
Jun 12 12:06:17 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:17Z" level=info msg="Proxy done" err="context canceled" url="wss://172.28.182.15:6443/v1-k3s/connect"
Jun 12 12:06:24 mngr1.redacted.hostname k3s[40217]: E0612 12:06:24.182265   40217 controller.go:102] "Unhandled Error" err=<
Jun 12 12:06:24 mngr1.redacted.hostname k3s[40217]:         loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 500, Body: Internal Server Error: "/openapi/v2": Post "https://10.43.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": net/http: request cancelecanceled (Client.Timeout exceeded while awaiting headers)
Jun 12 12:06:24 mngr1.redacted.hostname k3s[40217]:         , Header: map[Audit-Id:[63fa0f68-895b-457d-a454-b9015a0c5169] Cache-Control:[no-cache, private] Content-Length:[206] Content-Type:[text/plain; charset=utf-8] Date:[Thu, 12 Jun 2025 12:06:24 GMT] X-Content-Type-Options:[nosniff]]
Jun 12 12:06:24 mngr1.redacted.hostname k3s[40217]:  >
Jun 12 12:06:24 mngr1.redacted.hostname k3s[40217]: I0612 12:06:24.184357   40217 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Jun 12 12:06:25 mngr1.redacted.hostname k3s[40217]: E0612 12:06:25.762181   40217 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions for APIService"
Jun 12 12:06:25 mngr1.redacted.hostname k3s[40217]: I0612 12:06:25.764209   40217 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Jun 12 12:06:31 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:31.277545Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3a00b5126ed08c","rtt":"5.603995ms","error":"dial tcp 172.28.182.15:2380: i/o timeout"}
Jun 12 12:06:32 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:32.216382Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"9c3a00b5126ed08c","error":"failed to write 9c3a00b5126ed08c on stream Message (write tcp 172.28.182.13:2380->172.28.182.15:35212: i/o timeout)"}
Jun 12 12:06:32 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:32.216653Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:33 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:33.246795Z","caller":"rafthttp/peer.go:267","msg":"dropped internal Raft message since sending buffer is full (overloaded network)","message-type":"MsgHeartbeat","local-member-id":"76bdf153ca3e2ae5","from":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c","remote-peer-name":"pipeline","remote-peer-active":false}
[...] (line gets repeated about 600-700 times in a few seconds)
Jun 12 12:06:36 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:36.278753Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3a00b5126ed08c","rtt":"5.603995ms","error":"dial tcp 172.28.182.15:2380: i/o timeout"}
Jun 12 12:06:40 mngr1.redacted.hostname k3s[40217]: E0612 12:06:40.124419   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: client disconnected"
Jun 12 12:06:40 mngr1.redacted.hostname k3s[40217]: E0612 12:06:40.124531   40217 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/api/v1/pods?resourceVersion=205905411" auditID="ec4d3601-1541-4069-a923-eb79bbf608a4"
Jun 12 12:06:40 mngr1.redacted.hostname k3s[40217]: E0612 12:06:40.125646   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"client disconnected\"}: client disconnected"
Jun 12 12:06:40 mngr1.redacted.hostname k3s[40217]: E0612 12:06:40.127158   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: client disconnected"
Jun 12 12:06:40 mngr1.redacted.hostname k3s[40217]: E0612 12:06:40.129131   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.822592ms" method="GET" path="/api/v1/pods" result=null
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:41.111044Z","caller":"embed/config_logging.go:170","msg":"rejected connection on peer endpoint","remote-addr":"172.28.182.15:36568","server-name":"","error":"remote error: tls: bad certificate"}
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:41.111044Z","caller":"embed/config_logging.go:170","msg":"rejected connection on peer endpoint","remote-addr":"172.28.182.15:36552","server-name":"","error":"remote error: tls: bad certificate"}
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.192675   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.192861   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.193700   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 655.556µs, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.193739   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.193705   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 756.964µs, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.193951   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.194815   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.195011   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.195926   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.46798ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker4.redacted.hostname" result=null
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: E0612 12:06:41.196428   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.930884ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker4.redacted.hostname" result=null
Jun 12 12:06:41 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:41.279969Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3a00b5126ed08c","rtt":"5.603995ms","error":"dial tcp 172.28.182.15:2380: connect: connection refused"}
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:42.169229Z","caller":"embed/config_logging.go:170","msg":"rejected connection on peer endpoint","remote-addr":"172.28.182.15:36532","server-name":"","error":"remote error: tls: bad certificate"}
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:42.169679Z","caller":"embed/config_logging.go:170","msg":"rejected connection on peer endpoint","remote-addr":"172.28.182.15:36548","server-name":"","error":"remote error: tls: bad certificate"}
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: E0612 12:06:42.781809   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: E0612 12:06:42.782456   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 494.988µs, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: E0612 12:06:42.783066   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: E0612 12:06:42.784168   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:42 mngr1.redacted.hostname k3s[40217]: E0612 12:06:42.785310   40217 timeout.go:140] "Post-timeout activity" timeElapsed="3.619715ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker10.redacted.hostname" result=null
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.309708   40217 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.309836   40217 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.309952   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.309855   40217 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.309848   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.310038   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.310442   40217 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.310546   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.311151   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.311173   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.311204   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.312503   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.312524   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.312571   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.312602   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.313695   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.313734   40217 timeout.go:140] "Post-timeout activity" timeElapsed="3.934056ms" method="GET" path="/livez" result=null
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.313741   40217 timeout.go:140] "Post-timeout activity" timeElapsed="3.951608ms" method="GET" path="/livez" result=null
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.313741   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.181702ms" method="GET" path="/livez" result=null
Jun 12 12:06:44 mngr1.redacted.hostname k3s[40217]: E0612 12:06:44.314878   40217 timeout.go:140] "Post-timeout activity" timeElapsed="5.158631ms" method="GET" path="/apis/coordination.k8s.io/v1/namespaces/falcon-kac/leases/falcon-kac-lock" result=null
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.623733Z","caller":"traceutil/trace.go:171","msg":"trace[1983523222] linearizableReadLoop","detail":"{readStateIndex:223248201; appliedIndex:223248201; }","duration":"193.506392ms","start":"2025-06-12T12:06:46.430170Z","end":"2025-06-12T12:06:46.623676Z","steps":["trace[1983523222] 'read index received'  (duration: 193.495572ms)","trace[1983523222] 'applied index is now lower than readState.Index'  (duration: 8.806µs)"],"step_count":2}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:46.625536Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"242.764376ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-node-lease/worker20.redacted.hostname\" ","response":"range_response_count:1 size:655"}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.625674Z","caller":"traceutil/trace.go:171","msg":"trace[559481025] range","detail":"{range_begin:/registry/leases/kube-node-lease/worker20.redacted.hostname; range_end:; response_count:1; response_revision:205905705; }","duration":"242.958692ms","start":"2025-06-12T12:06:46.382688Z","end":"2025-06-12T12:06:46.625647Z","steps":["trace[559481025] 'agreement among raft nodes before linearized reading'  (duration: 242.488162ms)"],"step_count":1}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:46.625933Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"253.659671ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/pods/cloudnative-pg/cnpg-controller-manager-65dc5c5dfd-bzn4k\" ","response":"range_response_count:1 size:6583"}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.626018Z","caller":"traceutil/trace.go:171","msg":"trace[1107969609] range","detail":"{range_begin:/registry/pods/cloudnative-pg/cnpg-controller-manager-65dc5c5dfd-bzn4k; range_end:; response_count:1; response_revision:205905705; }","duration":"253.739393ms","start":"2025-06-12T12:06:46.372259Z","end":"2025-06-12T12:06:46.625998Z","steps":["trace[1107969609] 'agreement among raft nodes before linearized reading'  (duration: 253.537361ms)"],"step_count":1}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:46.626615Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"254.462018ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/traefik.io/middlewares/\" range_end:\"/registry/traefik.io/middlewares0\" count_only:true ","response":"range_response_count:0 size:10"}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.626740Z","caller":"traceutil/trace.go:171","msg":"trace[1933959841] range","detail":"{range_begin:/registry/traefik.io/middlewares/; range_end:/registry/traefik.io/middlewares0; response_count:0; response_revision:205905705; }","duration":"254.588893ms","start":"2025-06-12T12:06:46.372130Z","end":"2025-06-12T12:06:46.626719Z","steps":["trace[1933959841] 'agreement among raft nodes before linearized reading'  (duration: 254.14418ms)"],"step_count":1}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.767435Z","caller":"traceutil/trace.go:171","msg":"trace[342151075] linearizableReadLoop","detail":"{readStateIndex:223248201; appliedIndex:223248201; }","duration":"143.285585ms","start":"2025-06-12T12:06:46.624127Z","end":"2025-06-12T12:06:46.767412Z","steps":["trace[342151075] 'read index received'  (duration: 143.278144ms)","trace[342151075] 'applied index is now lower than readState.Index'  (duration: 5.575µs)"],"step_count":2}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:46.767698Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"242.522079ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/minions/worker15.redacted.hostname\" ","response":"range_response_count:1 size:13449"}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.767769Z","caller":"traceutil/trace.go:171","msg":"trace[1005394988] range","detail":"{range_begin:/registry/minions/worker15.redacted.hostname; range_end:; response_count:1; response_revision:205905705; }","duration":"242.591806ms","start":"2025-06-12T12:06:46.525154Z","end":"2025-06-12T12:06:46.767746Z","steps":["trace[1005394988] 'agreement among raft nodes before linearized reading'  (duration: 242.373146ms)"],"step_count":1}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:46.770680Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"168.513582ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-node-lease/worker15.redacted.hostname\" ","response":"range_response_count:1 size:655"}
Jun 12 12:06:46 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:46.770760Z","caller":"traceutil/trace.go:171","msg":"trace[1940013491] range","detail":"{range_begin:/registry/leases/kube-node-lease/worker15.redacted.hostname; range_end:; response_count:1; response_revision:205905705; }","duration":"168.61267ms","start":"2025-06-12T12:06:46.602130Z","end":"2025-06-12T12:06:46.770743Z","steps":["trace[1940013491] 'agreement among raft nodes before linearized reading'  (duration: 168.393891ms)"],"step_count":1}
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.554759   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.554916   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.556050   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.556630   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 1.677759ms, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.556753   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 1.931203ms, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.556776   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.557204   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.557835   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.560001   40217 timeout.go:140] "Post-timeout activity" timeElapsed="13.268ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker2.redacted.hostname" result=null
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.560122   40217 timeout.go:140] "Post-timeout activity" timeElapsed="13.081437ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker2.redacted.hostname" result=null
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.673734   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.675068   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 940.1µs, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.675428   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.677332   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:47 mngr1.redacted.hostname k3s[40217]: E0612 12:06:47.678772   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.659293ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker11.redacted.hostname" result=null
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.162669   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.163737   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.163873   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 1.116323ms, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.164855   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.166281   40217 timeout.go:140] "Post-timeout activity" timeElapsed="4.33149ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/datanode4-b.redacted.hostname" result=null
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.896181   40217 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.896777   40217 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 617.049µs, panicked: false, err: context canceled, panic-reason: <nil>"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.898197   40217 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.899551   40217 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:48 mngr1.redacted.hostname k3s[40217]: E0612 12:06:48.901909   40217 timeout.go:140] "Post-timeout activity" timeElapsed="7.242622ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/worker6.redacted.hostname" result=null
Jun 12 12:06:51 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:51.280210Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3a00b5126ed08c","rtt":"5.603995ms","error":"dial tcp 172.28.182.15:2380: connect: connection refused"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.423912Z","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"76bdf153ca3e2ae5","to":"9c3a00b5126ed08c","stream-type":"stream MsgApp v2"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.423984Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:52.424004Z","caller":"rafthttp/stream.go:265","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.424017Z","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.424608Z","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"76bdf153ca3e2ae5","to":"9c3a00b5126ed08c","stream-type":"stream Message"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.424633Z","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:52.639419Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"125.900983ms","expected-duration":"100ms","prefix":"","request":"header:<ID:3091042974215938693 username:\"etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/events/falcon-kac/falcon-kac-5777549664-lczgz.184585925da2769f\" mod_revision:0 > success:<request_put:<key:\"/registry/events/falcon-kac/falcon-kac-5777549664-lczgz.184585925da2769f\" value_size:711 lease:3091042974215937694 >> failure:<>>","response":"size:20"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.639602Z","caller":"traceutil/trace.go:171","msg":"trace[1318502227] transaction","detail":"{read_only:false; response_revision:205905802; number_of_response:1; }","duration":"128.928549ms","start":"2025-06-12T12:06:52.510650Z","end":"2025-06-12T12:06:52.639579Z","steps":["trace[1318502227] 'store kv pair into bolt db' {req_type:put; key:/registry/events/falcon-kac/falcon-kac-5777549664-lczgz.184585925da2769f; req_size:798; } (duration: 14.389985ms)"],"step_count":1}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:52.856017Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c","error":"unexpected EOF"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:52.856501Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"9c3a00b5126ed08c","error":"failed to read 9c3a00b5126ed08c on stream Message (unexpected EOF)"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.857616Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"76bdf153ca3e2ae5 [logterm: 71, index: 223248310, vote: 76bdf153ca3e2ae5] ignored MsgPreVote from 9c3a00b5126ed08c [logterm: 71, index: 223247972] at term 71: lease is not expired (remaining ticks: 3)"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.857731Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"76bdf153ca3e2ae5 [logterm: 71, index: 223248310, vote: 76bdf153ca3e2ae5] ignored MsgPreVote from 9c3a00b5126ed08c [logterm: 71, index: 223247972] at term 71: lease is not expired (remaining ticks: 3)"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.860680Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:52 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:52.860870Z","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:53 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:53.367627Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c","error":"unexpected EOF"}
Jun 12 12:06:53 mngr1.redacted.hostname k3s[40217]: {"level":"warn","ts":"2025-06-12T12:06:53.367731Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"9c3a00b5126ed08c","error":"failed to read 9c3a00b5126ed08c on stream MsgApp v2 (unexpected EOF)"}
Jun 12 12:06:53 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:53.371498Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:53 mngr1.redacted.hostname k3s[40217]: {"level":"info","ts":"2025-06-12T12:06:53.371777Z","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"76bdf153ca3e2ae5","remote-peer-id":"9c3a00b5126ed08c"}
Jun 12 12:06:54 mngr1.redacted.hostname k3s[40217]: E0612 12:06:54.071517   40217 timeout.go:140] "Post-timeout activity" timeElapsed="360.877µs" method="GET" path="/livez" result=null
Jun 12 12:06:54 mngr1.redacted.hostname k3s[40217]: E0612 12:06:54.071701   40217 timeout.go:140] "Post-timeout activity" timeElapsed="485.116µs" method="GET" path="/livez" result=null
Jun 12 12:06:54 mngr1.redacted.hostname k3s[40217]: E0612 12:06:54.071817   40217 timeout.go:140] "Post-timeout activity" timeElapsed="549.498µs" method="GET" path="/livez" result=null
Jun 12 12:06:56 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:56Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jun 12 12:06:58 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:58Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jun 12 12:06:58 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:06:58Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jun 12 12:07:00 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:00Z" level=info msg="Started tunnel to 172.28.182.15:6443"
Jun 12 12:07:00 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:00Z" level=info msg="Connecting to proxy" url="wss://172.28.182.15:6443/v1-k3s/connect"
Jun 12 12:07:00 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:00Z" level=info msg="Remotedialer connected to proxy" url="wss://172.28.182.15:6443/v1-k3s/connect"
Jun 12 12:07:01 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:01Z" level=info msg="Handling backend connection request [mngr3.redacted.hostname]"
Jun 12 12:07:01 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:01Z" level=info msg="Handling backend connection request [worker12.redacted.hostname]"
Jun 12 12:07:01 mngr1.redacted.hostname k3s[40217]: time="2025-06-12T12:07:01Z" level=info msg="Handling backend connection request [worker15.redacted.hostname]"

This looks to me like mngr1 (.13) loses the connection to mngr3 (.15) at ~12:06:17 because the k3s-agent restarts on mngr3 because of an error during the leader election process. I am not sure what the cause for this could be, and I also do not understand if or why this affects the connection from the kube api to the worker9 kubelet api. I also do not know if this could be the reason that logs cannot be pulled from that node since I would expect the kubelet api to be completely unavailable but e.g. pod deletion and other operations still work fine.

Details

Jun 12 12:06:01 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:01.100728Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:01 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:01.601330Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:02 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:02.102092Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:02 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:02.602369Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:03 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:03.103437Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:03 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:03.604458Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:04 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:04.105494Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:04 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:04.605664Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.106211Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.531490 2851930 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.531545 2851930 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 11.581µs, panicked: false, err: context deadline exceeded, panic-reason: <nil>"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.531493Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-06-12T12:06:00.532325Z","time spent":"4.9991521s","remote":"127.0.0.1:36616","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.532890 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.534198 2851930 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.535374 2851930 timeout.go:140] "Post-timeout activity" timeElapsed="3.889099ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/cloudnative-pg/leases/db9c8771.cnpg.io" result=null
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.599979 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.599923Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"4.999305973s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-system/kube-scheduler\" ","response":"","error":"context deadline exceeded"}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:05.600169Z","caller":"traceutil/trace.go:171","msg":"trace[1714001507] range","detail":"{range_begin:/registry/leases/kube-system/kube-scheduler; range_end:; }","duration":"4.999573689s","start":"2025-06-12T12:06:00.600571Z","end":"2025-06-12T12:06:05.600145Z","steps":["trace[1714001507] 'agreement among raft nodes before linearized reading'  (duration: 4.99930053s)"],"step_count":1}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.600217Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-06-12T12:06:00.600561Z","time spent":"4.999641856s","remote":"127.0.0.1:36616","response type":"/etcdserverpb.KV/Range","request count":0,"request size":45,"response count":0,"response size":0,"request content":"key:\"/registry/leases/kube-system/kube-scheduler\" "}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.600256 2851930 leaderelection.go:436] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.601671 2851930 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.602793 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.603906 2851930 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:05.605048 2851930 timeout.go:140] "Post-timeout activity" timeElapsed="4.999066ms" method="GET" path="/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler" result=null
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.607164Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005102Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c is starting a new election at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005187Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c became pre-candidate at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c received MsgPreVoteResp from 9c3a00b5126ed08c at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005248Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c [logterm: 71, index: 223247972] sent MsgPreVote request to 76bdf153ca3e2ae5 at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005257Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c [logterm: 71, index: 223247972] sent MsgPreVote request to f55cf49be666a821 at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:06.005267Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 9c3a00b5126ed08c lost leader 76bdf153ca3e2ae5 at term 71"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:06.107879Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:06.608966Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918105 2851930 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918171 2851930 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 5.329µs, panicked: false, err: context deadline exceeded, panic-reason: <nil>"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918186 2851930 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:06.918185Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-06-12T12:06:01.918593Z","time spent":"4.999584443s","remote":"127.0.0.1:36616","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918213 2851930 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 3.582µs, panicked: false, err: context deadline exceeded, panic-reason: <nil>"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:06.918237Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-06-12T12:06:01.918596Z","time spent":"4.999640022s","remote":"127.0.0.1:36616","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918342 2851930 leaderelection.go:429] Failed to update lock optimitically: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers), falling back to slow path
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.918482 2851930 leaderelection.go:429] Failed to update lock optimitically: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io k3s-cloud-controller-manager), falling back to slow path
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.919275 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.919282 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.921079 2851930 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.921088 2851930 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.922232 2851930 timeout.go:140] "Post-timeout activity" timeElapsed="4.157154ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager" result=null
Jun 12 12:06:06 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:06.922234 2851930 timeout.go:140] "Post-timeout activity" timeElapsed="4.010824ms" method="PUT" path="/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/k3s-cloud-controller-manager" result=null
Jun 12 12:06:07 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:07.109703Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:07 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:07.610894Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:08 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:08.111202Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:08 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:08.611965Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:09 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:09.112224Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:09 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:09.613285Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:10.114250Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:10.531881 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"context canceled\"}: context canceled"
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:10.532177Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"4.999117563s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/cloudnative-pg/db9c8771.cnpg.io\" ","response":"","error":"context canceled"}
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:10.532230Z","caller":"traceutil/trace.go:171","msg":"trace[938438695] range","detail":"{range_begin:/registry/leases/cloudnative-pg/db9c8771.cnpg.io; range_end:; }","duration":"4.99918858s","start":"2025-06-12T12:06:05.533030Z","end":"2025-06-12T12:06:10.532219Z","steps":["trace[938438695] 'agreement among raft nodes before linearized reading'  (duration: 4.999114792s)"],"step_count":1}
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:10.532258Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-06-12T12:06:05.533006Z","time spent":"4.999243643s","remote":"127.0.0.1:36616","response type":"/etcdserverpb.KV/Range","request count":0,"request size":50,"response count":0,"response size":0,"request content":"key:\"/registry/leases/cloudnative-pg/db9c8771.cnpg.io\" "}
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:10.533292 2851930 writers.go:122] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:10.534947 2851930 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:10.536265 2851930 writers.go:135] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:10.537534 2851930 timeout.go:140] "Post-timeout activity" timeElapsed="5.718524ms" method="GET" path="/apis/coordination.k8s.io/v1/namespaces/cloudnative-pg/leases/db9c8771.cnpg.io" result=null
Jun 12 12:06:10 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:10.614601Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:11.115272Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:11.504375Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c is starting a new election at term 71"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:11.504445Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c became pre-candidate at term 71"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:11.504462Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c received MsgPreVoteResp from 9c3a00b5126ed08c at term 71"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:11.504480Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c [logterm: 71, index: 223247972] sent MsgPreVote request to 76bdf153ca3e2ae5 at term 71"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:11.504490Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"9c3a00b5126ed08c [logterm: 71, index: 223247972] sent MsgPreVote request to f55cf49be666a821 at term 71"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:11.616432Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":15027552311608748566,"retry-timeout":"500ms"}
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:11.917635 2851930 leaderelection.go:436] error retrieving resource lock kube-system/k3s-cloud-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/k3s-cloud-controller-manager?timeout=5s": context deadline exceeded
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: I0612 12:06:11.917689 2851930 leaderelection.go:297] failed to renew lease kube-system/k3s-cloud-controller-manager: timed out waiting for the condition
Jun 12 12:06:11 mngr3.redacted.hostname k3s[2851930]: E0612 12:06:11.917769 2851930 controllermanager.go:265] "leaderelection lost"
Jun 12 12:06:12 mngr3.redacted.hostname systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 12:06:12 mngr3.redacted.hostname systemd[1]: k3s.service: Failed with result 'exit-code'.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Service RestartSec=5s expired, scheduling restart.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: Stopped Lightweight Kubernetes.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 439096 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 1584572 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2847105 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2848174 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2899572 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2903076 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2904063 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 3456499 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: Starting Lightweight Kubernetes...
Jun 12 12:06:17 mngr3.redacted.hostname sh[1721193]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jun 12 12:06:17 mngr3.redacted.hostname sh[1721194]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 439096 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 1584572 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2847105 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2848174 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2899572 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2903076 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 2904063 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: k3s.service: Found left-over process 3456499 (containerd-shim) in control group while starting unit. Ignoring.
Jun 12 12:06:17 mngr3.redacted.hostname systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 12:06:17 mngr3.redacted.hostname k3s[1721197]: time="2025-06-12T12:06:17Z" level=info msg="Starting k3s v1.31.7+k3s1 (e050ca66)"
Jun 12 12:06:17 mngr3.redacted.hostname k3s[1721197]: time="2025-06-12T12:06:17Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
Jun 12 12:06:32 mngr3.redacted.hostname k3s[1721197]: time="2025-06-12T12:06:32Z" level=warning msg="Unable to reconcile with remote datastore: Get \"https://172.28.182.13:6443/v1-k3s/server-bootstrap\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Jun 12 12:06:32 mngr3.redacted.hostname k3s[1721197]: time="2025-06-12T12:06:32Z" level=info msg="Starting temporary etcd to reconcile with datastore"

brandond Jun 13, 2025
Collaborator

Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"warn","ts":"2025-06-12T12:06:05.599923Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"4.999305973s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/kube-system/kube-scheduler\" ","response":"","error":"context deadline exceeded"}
Jun 12 12:06:05 mngr3.redacted.hostname k3s[2851930]: {"level":"info","ts":"2025-06-12T12:06:05.600169Z","caller":"traceutil/trace.go:171","msg":"trace[1714001507] range","detail":"{range_begin:/registry/leases/kube-system/kube-scheduler; range_end:; }","duration":"4.999573689s","start":"2025-06-12T12:06:00.600571Z","end":"2025-06-12T12:06:05.600145Z","steps":["trace[1714001507] 'agreement among raft nodes before linearized reading' (duration: 4.99930053s)"],"step_count":1}

K3s panicked because it wasn't able to update one of the controller leases in etcd within the expected timeout. This generally indicates that your disk performance is incapable of supporting the load that you're putting on them. If etcd isn't stable and performing within expected parameters, nothing else will work right either.

Is this the first you're noticing that k3s on your server nodes is crashing and restarting? Are you not monitoring this anywhere?

poehlerflorian Jun 16, 2025

Disk performance could very well be the reason since we already know that this is an issue for our nodes. We are not directly monitoring k3s on our servers at the moment. The up metrics for the kube api server however show 3 instances in the last two months with about 1 min of downtime which is probably the k3s service restarting. So far we did not have any alerts and thus did not yet notice this.

I am still not sure though if this is also the reason for the initial problem, or if this is maybe a completely different and unrelated issue because there is no time correlation between those two.

I also tested again if a restart of the problematic k3s-agent service would cause the problem to occur on a different node like before but this time all nodes seem fine.

We will discuss and increase our monitoring of the k8s core components for now and probably add some kind of monitoring for the k3s services. We'll also see if we can increase our disk performance.

If the issue arises again, I'll see if I can gather more logs and data to provide here.

Big thank you for your time and very valuable help!

Answer selected by dereknola

yajo · 2025-07-03T07:00:29Z

yajo
Jul 3, 2025

I'm on a similar situation. AFAICS it seems like at some point a process demands high resources, systemd-oomd kills something, and from then onwards the agent starts giving these errors. systemctl restart k3s in the node solves the issue. However I can't still figure out what's the thing that produces the OOM situation and why it ends up killing something important for k3s itself.

4 replies

adrianrademann Jul 30, 2025

We are observing similar behavior on v1.33.1+k3s. Sporadically, all pods on a single node respond with: "proxy error from 127.0.0.1:6443 while dialing 10.20.49.100:10250, code 502: 502 Bad Gateway.", when streaming logs. No OOM kills or other errors are present in the logs aside from this message in journald. Server metrics show no anomalies.

brandond Jul 30, 2025
Collaborator

Check the logs on the node that you are unable to retrieve logs from. This message indicates that the websocket tunnel from the node running the pod, to the node running the apiserver, has been disconnected.

adrianrademann Jul 30, 2025

Thank you for your prompt reply. I wonder if there is a specific log string or event I can query to detect websocket tunnel disconnection? We use the same setup across many smaller clusters where developers don’t consistently open log streams. I want to configure alerting for immediate notification to help isolate under what conditions this occurs.

brandond Jul 30, 2025
Collaborator

You should see messages like this:

ERRO[0111] Remotedialer proxy error; reconnecting...     error="websocket: close 1006 (abnormal closure): unexpected EOF" url="wss://172.17.0.4:6443/v1-k3s/connect"
INFO[0112] Server 172.17.0.4:6443@ACTIVE*->FAILED from failed health check
INFO[0114] Connecting to proxy                           url="wss://172.17.0.4:6443/v1-k3s/connect"

The connecting to proxy message will repeat until it is able to reconnect:

INFO[0206] Connecting to proxy                           url="wss://172.17.0.4:6443/v1-k3s/connect"
ERRO[0207] Failed to connect to proxy. Empty dialer response  error="dial tcp 172.17.0.4:6443: connect: no route to host"
ERRO[0207] Remotedialer proxy error; reconnecting...     error="dial tcp 172.17.0.4:6443: connect: no route to host" url="wss://172.17.0.4:6443/v1-k3s/connect"

Once it does reconnect, you will see additional messages regarding the server health.

If the nodes are running with --supervisor-metrics, you can also monitor the k3s_loadbalancer_server_health metric:

 Current health state of loadbalancer backend servers, labeled by loadbalancer name and server address.
 State is enum of 0=INVALID, 1=FAILED, 2=STANDBY, 3=UNCHECKED, 4=RECOVERING, 5=HEALTHY, 6=PREFERRED, 7=ACTIVE.

v1.31.2+k3s1 Failed to get logs on some specific nodes of the cluster #11847

Uh oh!

Uh oh!

Gab-Menezes Feb 25, 2025

Replies: 2 comments · 30 replies

Uh oh!

Uh oh!

brandond Feb 25, 2025 Collaborator

Uh oh!

Uh oh!

brandond Jun 12, 2025 Collaborator

Uh oh!

poehlerflorian Jun 13, 2025

Uh oh!

Uh oh!

brandond Jun 13, 2025 Collaborator

Uh oh!

poehlerflorian Jun 16, 2025

Uh oh!

yajo Jul 3, 2025

Uh oh!

adrianrademann Jul 30, 2025

Uh oh!

brandond Jul 30, 2025 Collaborator

Uh oh!

adrianrademann Jul 30, 2025

Uh oh!

Uh oh!

brandond Jul 30, 2025 Collaborator

Gab-Menezes
Feb 25, 2025

Replies: 2 comments 30 replies

brandond
Feb 25, 2025
Collaborator

brandond Jun 12, 2025
Collaborator

brandond Jun 13, 2025
Collaborator

yajo
Jul 3, 2025

brandond Jul 30, 2025
Collaborator

brandond Jul 30, 2025
Collaborator