Swapoff causes system freeze in RKE2 worker nodes #3892

ktivdv · 2023-02-10T12:43:35Z

Hello, I've observed the following and I'm wondering whether it's normal (also I hope this is the right place for this report):

Environment:
Ubuntu VMs (both 22.04 and 20.04 tested) running on Hyper-V
Rancher 2.6.9 and 2.7.1
RKE2 Rancher-deployed clusters, both 1.23 and 1.24
Various storage solutions - Longhorn, smb.csi, NFS

I've been running the same apps for nearly a year without issue. We had a hardware disaster event which required me to redeploy most of the nodes in my cluster. One of the the kubernetes node requirements I had previously overlooked was to disable swap, so I decided to do that on all the new nodes. Some days later, I started getting nodes freezing once every 1-2 days. Complete system halt - frozen console, requiring a hard reset. This would happen on only some of the nodes. I've even seen them freeze again after having been back up only for 5-10 minutes. I then spun up a completely new cluster, and the same thing started happening once I put any significant load on it.

I have been troubleshooting this for weeks, and then today I happened to see that during boot on one of the nodes, it actually tries to start swap and fails. Puzzled that swap should not be enabled, I reconfirmed that swap had been completely disabled on that node, and indeed it was. Still, thankfully I was given this clue. I then checked all the nodes to see the status of swap, and then I saw the pattern. Some of my swap disabling had actually not been successful on some of the nodes, and those were the nodes that never froze. In other words, swap was still working on some nodes, and those nodes did not ever freeze. Nodes that had swap disabled would always freeze, eventually. I confirmed that I had followed the very straightforward instructions for disabling swap, that seem to be the same all over the internet (for example):

https://discuss.kubernetes.io/t/swap-off-why-is-it-necessary/6879

I searched for any clues about this situation, and I can't see that I did anything incorrect, nor anyone experiencing the same thing. So in sum, it's a documented requirement to disable swap, yet disabling swap kills my clusters. I have since re-enabled it on all the nodes and the clusters are perfectly stable now. I'm definitely leaving swap on, and I see there's some support in k8s for swap now, albeit perhaps still in alpha...but does it make any sense that swapoff would cause this freezing? Thank you!

brandond · 2023-02-10T18:13:34Z

I'm not aware of anything that would cause a node to hard-lock when it runs out of memory. The usual scenario is that it starts OOM-killing processes until more memory is available. I suspect that perhaps you've over-provisioned the VMs by allocating more memory than you actually have available, and when they attempt to use 100% of it, hyper-v begins thrashing and is unable to actually make that memory available to them?

ktivdv · 2023-02-10T22:22:30Z

I get that it's easy to blame hyper-v here, but I've been running these apps for a year in the same cluster without any issue whatsoever, until disabling swap, and re-enabling swap resolves the issue. Also, I have read that k8s runs fine in Azure, although I personally have not deployed any clusters there.

They're definitely not over-provisioned. K8s resource requests, limits, and actual usage are all far below actual the physical resource limits. I have a monitoring solution in place, and it monitors the CPU / RAM / disk activity of the VMs, and those are also well-below limits. I also have a lot of other windows and linux resources running without issue in my hyper-v rack, which is a 2-node cluster consisting of 80 (eighty) cores, 1 TB of RAM, and a SAN, on a 10Gb network - so there's no shortage of physical resources here.

brandond · 2023-02-10T22:28:26Z

I'm not sure where to point you then, other than setting up some syslog forwarding or something else that would get the logs off the box as its hanging. The kernel or VM becoming unresponsive is definitely below the level of anything RKE2 would handle.

Swap is just a crutch for nodes under memory pressure, if the node hangs under memory pressure without it then you have some pretty low-level problems... and I'd point at the virtualization layer before I pointed at the kernel.

ktivdv · 2023-02-10T22:32:12Z

OK, you raise a rather valid point. I will investigate. Thanks for your input!

caruccio · 2025-01-31T12:25:25Z

Hey @ktivdv. Did you found the reason for this swap problem?

ktivdv closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swapoff causes system freeze in RKE2 worker nodes #3892

Swapoff causes system freeze in RKE2 worker nodes #3892

ktivdv commented Feb 10, 2023

brandond commented Feb 10, 2023 •

edited

Loading

ktivdv commented Feb 10, 2023

brandond commented Feb 10, 2023 •

edited

Loading

ktivdv commented Feb 10, 2023

caruccio commented Jan 31, 2025

Swapoff causes system freeze in RKE2 worker nodes #3892

Swapoff causes system freeze in RKE2 worker nodes #3892

Comments

ktivdv commented Feb 10, 2023

brandond commented Feb 10, 2023 • edited Loading

ktivdv commented Feb 10, 2023

brandond commented Feb 10, 2023 • edited Loading

ktivdv commented Feb 10, 2023

caruccio commented Jan 31, 2025

brandond commented Feb 10, 2023 •

edited

Loading

brandond commented Feb 10, 2023 •

edited

Loading