-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575
Comments
Hey, any chance this setup is using cgroupsv1? with cgroupsv1, the total host's available memory gets taken into account so overwhelming the pod limits was easy. Anyway, cgroupsv1 support is dropped by now from the k8s side IIRC. |
I can confirm I'm not using cgroupsv1.
I don't have this flag set so I assume this is using v2. As for the StorageClass, I'm using the Rancher hostPath CSI, I might try using other ones like the openebs local pv CSI, though I doubt that changes. I'll see if I can try applying that patch to talos' kernel, so this might eventually get fixed in v1.10.0 whenever that comes out. |
Yeah if you have the sources should be easy enough to check if the revert made it in, super small change |
Another question, what are the dirty_rate values for talos? |
Ran this command to check those values:
|
So usually I wouldn't expect that kernel issue with local storage on the node, but, it's possible that the hostpath driver could be configured to a path that's backed by something else entirely. I know this is possible with https://github.com/kubevirt/hostpath-provisioner-operator. Usually, the OOMs due to the kernel bug would occur with NFS/slow disks |
I see. I'm using a kinda old SATA SSD as the disk backing that local-storage but I can try getting a localStorage SC that uses the NVMe boot drive, at least so I can get this to work. Talos' build stuff doesn't seem to play nice with Podman and I'm currently not at home so fiddling around with that right now is a bit problematic 😅 |
Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed. |
Hmm interesting, didn't expect that. Did you set up rancher to use that disk? |
Yeah definitely cgroupsv2. Can you reproduce this with
|
I don't have kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: example-pvc
spec:
storageClassName: local-path-ephemeral
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dd-test-thingy
spec:
selector:
matchLabels:
name: dd-test-thingy
replicas: 1
strategy:
type: Recreate
rollingUpdate: null
template:
metadata:
labels:
name: dd-test-thingy
spec:
containers:
- name: thingy
resources:
limits:
memory: 600M
image: quay.io/centos/centos:stream9
command: ["/bin/bash"]
args: ["-c", "while true ;do sleep 50; done"] # I'll just kubectl exec into it and run dd so I can properly look at it do stuff
volumeMounts:
- mountPath: /datadir
name: dd-test-thingy-nvme-vol
volumes:
- name: dd-test-thingy-nvme-vol
persistentVolumeClaim:
claimName: example-pvc Will post results ASAP |
Ok, dd also got killed by OOM: [root@dd-test-thingy-799cd97974-zldtg /]# dd if=/dev/urandom of=/datadir/2G.bin bs=32K count=65536 status=progress iflag=fullblock
421593088 bytes (422 MB, 402 MiB) copied, 1 s, 422 MB/scommand terminated with exit code 137 And here's the dmesg dump: |
Yup looks exactly like that kernel bug. I am surprised it happens with storage that should be able to flush the data to disk quite quickly |
I looked into it and seems they released a new version of Talos running kernel 6.12.6. |
I guess I spoke too soon. Uploading an image still gives the same problem, both on SATA and the NVMe SSDs. |
I did a few passes and they consistently passed. Now I consistently get OOM'd... I got a little too excited 😅 |
From a quick check https://github.com/gregkh/linux does not have the revert commit which centos does |
Huh, just bumped into a fresh 6.13 RC commit that probably tackles the same issue without a revert |
I'll look out for new Talos releases and report back in case anything changes. I will work around this using DataVolume's import feature for now. |
Yeah you can either do that, or, find a way to rate limit the upload (I am assuming it's blazing fast, so it being slow could work around the OOM) |
What happened:
Using the command
k virt image-upload dv oi-hipster-gui-20240426-iso --image-path=./OI-hipster-gui-20240426.iso --size 3Gi --volume-mode filesystem --access-mode ReadWriteOnce --force-bind --uploadproxy-url=https://172.16.8.132:443/ --insecure
any upload gets terminated, returning
unexpected return value 502, error in upload-proxy: http: proxy error: write tcp 10.244.0.168:33826->10.99.16.191:443: write: connection reset by peer
.Looking into Talos' dashboard I saw that the cdi-upload-server container was killed by oom_reaper.
I get this every single time I try to upload this image.
What you expected to happen:
I expected the transfer to go successfully
How to reproduce it (as minimally and precisely as possible):
Additional context:
The machine has more than enough RAM available to do the job (or at least I think so):
I have tried exposing the upload proxy both as a LoadBalancer and through a TLSRoute (since I initially thought the gateway controller was acting up and not OOM).
I have a dump of the error message:
dmesg-oom.txt
Environment:
kubectl get deployments cdi-deployment -o yaml
): v1.61.0kubectl version
): v1.29.11uname -a
): 6.12.5-talosThe text was updated successfully, but these errors were encountered: