Replies: 1 comment 1 reply
-
I think your main container runc killed by OOM. check you running pod |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a problem which I initially though was related to this.
I have a workflow which usually runs through fine. It creates hundreds of pods, processing data, each one usually completing without issue.
However, every once in a while, the main container does not move out of its "Running" Phase (see the screenshot).
This pod should take a couple minutes but it's been sitting in Running phase for 13 hrs.
Initially, I thought there was a bug in the code, keeping the container alive, so I added a bunch of logging - it does seem that the code executes successfully. In my case, the last log item just says that an upload completed successfully.
Further, these "hanging pod" problems occur sporadically - it's not necessarily the same piece of data which it fails on, which leads me to believe that it's not a data issue.
The wait container is of course still running (which is why I thought maybe it was related to the issue I linked above), but I guess the wait container is just still running because the main container is still running?
I'm not sure how to debug this - any advice would be greatly appreciated :)
(argo-workflows v3.0.8)
Further:
when I try to exec into the main container of the hanging pod by:
kubectl exec -it --container main -- /bin/bash
I get:
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused "exit status 1"": unknown
command terminated with exit code 126
However, I am able to exec into the wait container. Also, I am able to exec into the main container of a healthy pod (before it completes, of course).
Beta Was this translation helpful? Give feedback.
All reactions