Skip to content

"enroot list -f" can fail if a container PID disappears #126

Open
@flx42

Description

@flx42

I have a stress test where many pyxis containers are ran in sequence on a node with the same container filesystem (--container-name).
Pyxis calls enroot list -f to check if a container exists and if it's already running.
But it seems that enroot list -f can fail with return code 1 if a running container exits during the invocation of enroot list -f.

It seems to be caused by the ps -p command:

ps -p "${entry[*]}" --no-headers -o pid:1,stat:1,stime:1,etime:1,mntns:1,userns:1,command:1 \

With the container PID exiting between the call to lsns and the call to ps -p.

In rare occasions, I also saw a return code of 2 and it's likely caused by a similar race between the lsns and this call:

name=$(awk '($5 == "/"){print $4; exit}' "/proc/${pid}/mountinfo" 2> /dev/null)

Example where it happens (might need a few attempts and/or tweaking the sleep duration):

$ enroot import -o /tmp/pytorch.sqsh docker://nvcr.io#nvidia/pytorch:22.05-py3
$ enroot create --force --name pyxis_pytorch /tmp/pytorch.sqsh
$ { enroot start --root --rw pyxis_pytorch sleep 1s &>/dev/null & }; sleep 1.4 ; enroot list -f; echo $?; wait; 
[1] 1536191
NAME  PID  COMM  STATE  STARTED  TIME  MNTNS  USERNS  COMMAND
[1]+  Done                    enroot start --root --rw pyxis_pytorch sleep 1s &> /dev/null
1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions