Main Container of Pod Running Indefinitely. Holds up Workflow. How to Debug? #6266

orrymr · 2021-07-01T20:05:19Z

orrymr
Jul 1, 2021

I have a problem which I initially though was related to this.
I have a workflow which usually runs through fine. It creates hundreds of pods, processing data, each one usually completing without issue.
However, every once in a while, the main container does not move out of its "Running" Phase (see the screenshot).

This pod should take a couple minutes but it's been sitting in Running phase for 13 hrs.
Initially, I thought there was a bug in the code, keeping the container alive, so I added a bunch of logging - it does seem that the code executes successfully. In my case, the last log item just says that an upload completed successfully.
Further, these "hanging pod" problems occur sporadically - it's not necessarily the same piece of data which it fails on, which leads me to believe that it's not a data issue.
The wait container is of course still running (which is why I thought maybe it was related to the issue I linked above), but I guess the wait container is just still running because the main container is still running?
I'm not sure how to debug this - any advice would be greatly appreciated :)
(argo-workflows v3.0.8)

Further:

when I try to exec into the main container of the hanging pod by:

kubectl exec -it --container main -- /bin/bash

I get:

OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused "exit status 1"": unknown
command terminated with exit code 126

However, I am able to exec into the wait container. Also, I am able to exec into the main container of a healthy pod (before it completes, of course).

sarabala1979 · 2021-07-01T20:40:39Z

sarabala1979
Jul 1, 2021
Maintainer

I think your main container runc killed by OOM. check you running pod top

1 reply

orrymr Jul 2, 2021
Author

Thanks for the reply.
I'll check that next time this happens.
Some questions:

What is OOM?
Am I looking for it in top ?

I have a suspicion that my preStop lifecycle hook is the problem.
Essentially, what it's doing is unmounting a FUSE filesystem using the command:
preStop: exec: command: - /bin/sh - -c - fusermount -u /my/directory

Eventually in the logs I found something along the lines of:

task gcsfuse blocked for more than seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main Container of Pod Running Indefinitely. Holds up Workflow. How to Debug? #6266

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Main Container of Pod Running Indefinitely. Holds up Workflow. How to Debug? #6266

orrymr Jul 1, 2021

Replies: 1 comment · 1 reply

sarabala1979 Jul 1, 2021 Maintainer

orrymr Jul 2, 2021 Author

orrymr
Jul 1, 2021

Replies: 1 comment 1 reply

sarabala1979
Jul 1, 2021
Maintainer

orrymr Jul 2, 2021
Author