Open
Description
I have a stress test where many pyxis containers are ran in sequence on a node with the same container filesystem (--container-name
).
Pyxis calls enroot list -f
to check if a container exists and if it's already running.
But it seems that enroot list -f
can fail with return code 1
if a running container exits during the invocation of enroot list -f
.
It seems to be caused by the ps -p
command:
Line 535 in 2f0c49e
With the container PID exiting between the call to
lsns
and the call to ps -p
.
In rare occasions, I also saw a return code of 2
and it's likely caused by a similar race between the lsns and this call:
Line 518 in 2f0c49e
Example where it happens (might need a few attempts and/or tweaking the sleep duration):
$ enroot import -o /tmp/pytorch.sqsh docker://nvcr.io#nvidia/pytorch:22.05-py3
$ enroot create --force --name pyxis_pytorch /tmp/pytorch.sqsh
$ { enroot start --root --rw pyxis_pytorch sleep 1s &>/dev/null & }; sleep 1.4 ; enroot list -f; echo $?; wait;
[1] 1536191
NAME PID COMM STATE STARTED TIME MNTNS USERNS COMMAND
[1]+ Done enroot start --root --rw pyxis_pytorch sleep 1s &> /dev/null
1
Metadata
Metadata
Assignees
Labels
No labels