CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

IAMSolaara · 2024-12-28T18:25:12Z

What happened:
Using the command
k virt image-upload dv oi-hipster-gui-20240426-iso --image-path=./OI-hipster-gui-20240426.iso --size 3Gi --volume-mode filesystem --access-mode ReadWriteOnce --force-bind --uploadproxy-url=https://172.16.8.132:443/ --insecure
any upload gets terminated, returning
unexpected return value 502, error in upload-proxy: http: proxy error: write tcp 10.244.0.168:33826->10.99.16.191:443: write: connection reset by peer.

Looking into Talos' dashboard I saw that the cdi-upload-server container was killed by oom_reaper.

I get this every single time I try to upload this image.

What you expected to happen:
I expected the transfer to go successfully

How to reproduce it (as minimally and precisely as possible):

Issue the command I mentioned.
The transfer starts and gets interrupted with the above-mentioned error.

Additional context:
The machine has more than enough RAM available to do the job (or at least I think so):

I have tried exposing the upload proxy both as a LoadBalancer and through a TLSRoute (since I initially thought the gateway controller was acting up and not OOM).

I have a dump of the error message:
dmesg-oom.txt

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.61.0
Kubernetes version (use kubectl version): v1.29.11
DV specification: N/A
Cloud provider or hardware configuration: Intel Core i5-8600K, 24GB of DDR4 RAM, bare-metal Talos installation
OS (e.g. from /etc/os-release): Talos Linux v1.9.0
Kernel (e.g. uname -a): 6.12.5-talos
Install tools: I installed CDI following the guide at https://kubevirt.io/user-guide/storage/containerized_data_importer/#install-cdi
Others: N/A

The text was updated successfully, but these errors were encountered:

akalenyu · 2024-12-29T08:21:10Z

Hey, any chance this setup is using cgroupsv1? with cgroupsv1, the total host's available memory gets taken into account so overwhelming the pod limits was easy. Anyway, cgroupsv1 support is dropped by now from the k8s side IIRC.
If it's not, check this out #3557 (comment)

IAMSolaara · 2024-12-29T10:46:42Z

I can confirm I'm not using cgroupsv1.

Talos defaults to always using the unified cgroup hierarchy (cgroupsv2), but cgroupsv1 can be forced with talos.unified_cgroup_hierarchy=0.

I don't have this flag set so I assume this is using v2.

As for the StorageClass, I'm using the Rancher hostPath CSI, I might try using other ones like the openebs local pv CSI, though I doubt that changes.

I'll see if I can try applying that patch to talos' kernel, so this might eventually get fixed in v1.10.0 whenever that comes out.

akalenyu · 2024-12-29T10:53:52Z

Yeah if you have the sources should be easy enough to check if the revert made it in, super small change
https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4523/commits

akalenyu · 2024-12-29T10:54:35Z

Another question, what are the dirty_rate values for talos?
sudo sysctl -a | grep dirty

IAMSolaara · 2024-12-29T11:39:09Z

Another question, what are the dirty_rate values for talos? sudo sysctl -a | grep dirty

Ran this command to check those values:

talosctl ls /proc/sys/vm -n 172.16.1.250 
	| parse '{node} {name}' | select name 
	| where {|it| $it.name | str contains 'dirty'} | str trim 
	| each {|it| insert value (talosctl cat /proc/sys/vm/($it.name) -n 172.16.1.250) }

I got these values:

akalenyu · 2024-12-29T11:54:41Z

As for the StorageClass, I'm using the Rancher hostPath CSI

So usually I wouldn't expect that kernel issue with local storage on the node, but, it's possible that the hostpath driver could be configured to a path that's backed by something else entirely. I know this is possible with https://github.com/kubevirt/hostpath-provisioner-operator.

Usually, the OOMs due to the kernel bug would occur with NFS/slow disks
https://lore.kernel.org/lkml/[email protected]/T/

IAMSolaara · 2024-12-29T11:59:42Z

Usually, the OOMs due to the kernel bug would occur with NFS/slow disks

I see. I'm using a kinda old SATA SSD as the disk backing that local-storage but I can try getting a localStorage SC that uses the NVMe boot drive, at least so I can get this to work.

Talos' build stuff doesn't seem to play nice with Podman and I'm currently not at home so fiddling around with that right now is a bit problematic 😅

IAMSolaara · 2024-12-29T12:53:13Z

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

akalenyu · 2024-12-29T12:58:01Z

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

Hmm interesting, didn't expect that. Did you set up rancher to use that disk?
BTW could you just talosctl ls /sys/fs/cgroup to make sure we're working with cgroupsv2

IAMSolaara · 2024-12-29T13:02:00Z

Just tried using the NVMe and I still get issues. I'll see if I can get that kernel patch installed.

Hmm interesting, didn't expect that. Did you set up rancher to use that disk?

Yep, the files live on the NVMe and I could see disk activity reflect that:

BTW could you just talosctl ls /sys/fs/cgroup to make sure we're working with cgroupsv2

Sure, here's the output:

talosctl ls /sys/fs/cgroup -n 172.16.1.250
NODE           NAME
172.16.1.250   .
172.16.1.250   cgroup.controllers
172.16.1.250   cgroup.max.depth
172.16.1.250   cgroup.max.descendants
172.16.1.250   cgroup.pressure
172.16.1.250   cgroup.procs
172.16.1.250   cgroup.stat
172.16.1.250   cgroup.subtree_control
172.16.1.250   cgroup.threads
172.16.1.250   cpu.pressure
172.16.1.250   cpu.stat
172.16.1.250   cpu.stat.local
172.16.1.250   cpuset.cpus.effective
172.16.1.250   cpuset.cpus.isolated
172.16.1.250   cpuset.mems.effective
172.16.1.250   init
172.16.1.250   io.pressure
172.16.1.250   io.stat
172.16.1.250   kubepods
172.16.1.250   memory.numa_stat
172.16.1.250   memory.pressure
172.16.1.250   memory.reclaim
172.16.1.250   memory.stat
172.16.1.250   podruntime
172.16.1.250   system

akalenyu · 2024-12-29T13:06:10Z

Yeah definitely cgroupsv2. Can you reproduce this with dd and containers?
Something like

podman run -m 600m --mount type=bind,source=/mnt/oom-nfs,target=/disk --rm -it quay.io/centos/centos:stream9 bash
dd if=/dev/urandom of=/disk/2G.bin bs=32K count=65536 status=progress iflag=fullblock

IAMSolaara · 2024-12-29T13:12:00Z

I don't have podman on Talos, but I should have made an equivalent enough manifest:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: example-pvc
spec:
  storageClassName: local-path-ephemeral
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dd-test-thingy
spec:
  selector:
    matchLabels:
      name: dd-test-thingy
  replicas: 1
  strategy:
    type: Recreate
    rollingUpdate: null
  template:
    metadata:
      labels:
        name: dd-test-thingy
    spec:
      containers:
        - name: thingy
          resources:
            limits:
              memory: 600M
          image: quay.io/centos/centos:stream9
          command: ["/bin/bash"]
          args: ["-c", "while true ;do sleep 50; done"] # I'll just kubectl exec into it and run dd so I can properly look at it do stuff
          volumeMounts:
            - mountPath: /datadir
              name: dd-test-thingy-nvme-vol
      volumes:
        - name: dd-test-thingy-nvme-vol
          persistentVolumeClaim:
            claimName: example-pvc

Will post results ASAP

IAMSolaara · 2024-12-29T13:15:16Z

Ok, dd also got killed by OOM:

[root@dd-test-thingy-799cd97974-zldtg /]# dd if=/dev/urandom of=/datadir/2G.bin bs=32K count=65536 status=progress iflag=fullblock
421593088 bytes (422 MB, 402 MiB) copied, 1 s, 422 MB/scommand terminated with exit code 137

And here's the dmesg dump:
oom-2.txt

akalenyu · 2024-12-29T13:17:15Z

Yup looks exactly like that kernel bug. I am surprised it happens with storage that should be able to flush the data to disk quite quickly

IAMSolaara · 2024-12-30T11:52:16Z

I looked into it and seems they released a new version of Talos running kernel 6.12.6.
It seems like they use a vanilla kernel? https://github.com/siderolabs/pkgs/blob/45c4ba4957b013015a5b1457162b1659a2149712/Pkgfile#L75-L78
I tried digging in and I can't quite point whether they have that patch in...
I'm gonna try to update Talos and see if this is fixed.

IAMSolaara · 2024-12-30T12:04:51Z

Update done. Test passes and no OOM reapers in sight 🎉

Gonna try using CDI and report back but I think we're in the clear.

IAMSolaara · 2024-12-30T12:12:00Z

I guess I spoke too soon. Uploading an image still gives the same problem, both on SATA and the NVMe SSDs.

dmesg-191.txt

akalenyu · 2024-12-30T12:15:56Z

Update done. Test passes and no OOM reapers in sight 🎉

Gonna try using CDI and report back but I think we're in the clear.

Are you sure it wasn't just a lucky pass? or does it consistently not OOM?

IAMSolaara · 2024-12-30T12:20:16Z

I did a few passes and they consistently passed. Now I consistently get OOM'd...

I got a little too excited 😅

akalenyu · 2024-12-30T12:21:51Z

I did a few passes and they consistently passed. Now I consistently get OOM'd...

I got a little too excited 😅

From a quick check https://github.com/gregkh/linux does not have the revert commit which centos does

akalenyu · 2024-12-30T12:28:34Z

Huh, just bumped into a fresh 6.13 RC commit that probably tackles the same issue without a revert
gregkh/linux@1bc542c

IAMSolaara · 2024-12-30T13:19:15Z

I'll look out for new Talos releases and report back in case anything changes. I will work around this using DataVolume's import feature for now.

akalenyu · 2024-12-30T13:34:20Z

I'll look out for new Talos releases and report back in case anything changes. I will work around this using DataVolume's import feature for now.

Yeah you can either do that, or, find a way to rate limit the upload (I am assuming it's blazing fast, so it being slow could work around the OOM)

IAMSolaara added the kind/bug label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

IAMSolaara commented Dec 28, 2024

akalenyu commented Dec 29, 2024 •

edited

Loading

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024 •

edited

Loading

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

akalenyu commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

CDI Upload Server pod gets terminated by OOM Killer on Talos v1.9.0 #3575

Comments

IAMSolaara commented Dec 28, 2024

akalenyu commented Dec 29, 2024 • edited Loading

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024 • edited Loading

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

IAMSolaara commented Dec 29, 2024

akalenyu commented Dec 29, 2024

IAMSolaara commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

akalenyu commented Dec 30, 2024

IAMSolaara commented Dec 30, 2024

akalenyu commented Dec 30, 2024

akalenyu commented Dec 29, 2024 •

edited

Loading

IAMSolaara commented Dec 29, 2024 •

edited

Loading