Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter using Singularity fails creating SIF #5

Open
atombaby opened this issue Apr 10, 2020 · 9 comments
Open

Scatter using Singularity fails creating SIF #5

atombaby opened this issue Apr 10, 2020 · 9 comments
Assignees

Comments

@atombaby
Copy link
Contributor

When using singularity exec within tasks executed within scatter there is a race condition when the Docker/Singularity image isn't in the cache. On NFS mounted home directories this apparently results in an error "stale NFS file handle"

To replicate it's necessary to remove the images from the Singularity cache (~/.singularity). It also is difficult to replicate with simple images (e.g. ubuntu). The Broad's GATK image seems to reproduce this error fairly reliably.

==> shard-0/execution/stderr <==
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs/urls root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
2020/04/10 07:06:02 debug unpacking entry           path=root/.conda/pkgs/urls.txt root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
2020/04/10 07:06:02 debug unpacking entry           path=root/.gradle root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=53
2020/04/10 07:06:02 debug unpacking entry           path=root/gatk.jar root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=50
2020/04/10 07:06:02 debug unpacking entry           path=root/run_unit_tests.sh root=/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e type=48
DEBUG   [U=34152,P=14608]  Full()                        Inserting Metadata
DEBUG   [U=34152,P=14608]  Full()                        Calling assembler
INFO    [U=34152,P=14608]  Assemble()                    Creating SIF file...
DEBUG   [U=34152,P=14608]  cleanUp()                     Cleaning up "/loc/scratch/46802618/rootfs-3371b095-7b34-11ea-ae11-002590e2b58e" and "/loc/scratch/46802618/bundle-temp-539962740"
FATAL   [U=34152,P=14608]  replaceURIWithImage()         Unable to handle docker://broadinstitute/gatk@sha256:0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa uri: unable to build: while creating SIF: while creating container: writing data object for SIF file: copying data object file to SIF file: write /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif: stale NFS file handle
==> shard-1/execution/stderr <==
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs/urls root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
2020/04/10 07:06:13 debug unpacking entry           path=root/.conda/pkgs/urls.txt root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
2020/04/10 07:06:13 debug unpacking entry           path=root/.gradle root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=53
2020/04/10 07:06:13 debug unpacking entry           path=root/gatk.jar root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=50
2020/04/10 07:06:13 debug unpacking entry           path=root/run_unit_tests.sh root=/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824 type=48
DEBUG   [U=34152,P=3126]   Full()                        Inserting Metadata
DEBUG   [U=34152,P=3126]   Full()                        Calling assembler
INFO    [U=34152,P=3126]   Assemble()                    Creating SIF file...
VERBOSE [U=34152,P=3126]   Full()                        Build complete: /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif
DEBUG   [U=34152,P=3126]   cleanUp()                     Cleaning up "/loc/scratch/46802619/rootfs-4740289f-7b34-11ea-ad57-002590e2b824" and "/loc/scratch/46802619/bundle-temp-701530895"
VERBOSE [U=34152,P=3126]   handleOCI()                   Image cached as SIF at /home/mrg/.singularity/cache/oci-tmp/0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa/gatk@sha256_0dd5cb7f9321dc5a43e7667ed4682147b1e827d6a3e5f7bf4545313df6d491aa.sif
DEBUG   [U=34152,P=3126]   execStarter()                 Checking for encrypted system partition

.... output trimmed- log indicates this shard ran the container....
@vortexing
Copy link
Collaborator

Sometimes this is still happening. The file lock on the .sif that is created when the first job starts to convert the docker container doesn't seem to work right. I'm unsure still how to fix that.

@atombaby atombaby self-assigned this Jan 20, 2022
@vortexing
Copy link
Collaborator

I'm still not sure how to fix this. I have begun just sending a workflow with a scatter in it with a scatter of 1 first, so that that one job will go pull the docker container and convert it, and then when the larger scatter happens, it's already cached in Scratch.

@atombaby
Copy link
Contributor Author

Just to clarify- it's been a while since I dug into this- we're still doing the image build as part of the submit-docker script? Here in the diy server config?

I've got the diy server up and running and have a minimal case for reproducing this. I'm not sure if the switch to the new home file server will change the behavior at all (i.e. will it still throw a "stale file handle"), but all of the shards are still trying to build the image concurrently which will produce some... interesting... results.

The file lock on the .sif that is created when the first job starts to convert the docker container doesn't seem to work right. I'm unsure still how to fix that.

Where is that lock file set? Is it set via singularity pull?

@atombaby
Copy link
Contributor Author

atombaby commented Feb 11, 2022

By way of reference, it looks like there are others working on the issue: broadinstitute/cromwell#5063 and a possible helper in prepull

@atombaby
Copy link
Contributor Author

I've put in a PR ... does this look like it might work for us here?

@vortexing
Copy link
Collaborator

I'm not sure how to incorporate the biowdl prepull into Cromwell's backend config itself. I bet it's being used as a tool taht runs inside each workflow that specifically pulls the docker containers that that workflow will use. I'd prefer to find a solution that is invisible to the users if at all possible b/c if it's not then the WDLs they write actually won't be directly portable to other backends.

@atombaby
Copy link
Contributor Author

atombaby commented Feb 15, 2022

Part of the problem may be that singularity pull always seems to look in $HOME/.singularity no matter what you've set SINGULARITY_CACHEDIR to:

gizmok170[~]: echo $SINGULARITY_CACHEDIR/
/loc/scratch/49107063/
gizmok170[~]: singularity cache list -v
NAME                     DATE CREATED           SIZE             TYPE

There are 0 container file(s) using 0.00 kB and 0 oci blob file(s) using 0.00 kB of space
Total space used: 0.00 kB
gizmok170[~]: singularity pull docker://godlovedc/lolcow
FATAL:   Image file already exists: "lolcow_latest.sif" - will not overwrite
gizmok170[~]: unset SINGULARITY_CACHEDIR
gizmok170[~]: singularity cache list -v
NAME                     DATE CREATED           SIZE             TYPE
lolcow_latest.sif        2022-02-14 16:09:49    92.23 MB         oci
3b61febd4aefe982e0cb9c   2022-02-14 16:09:17    0.85 kB          blob
73d5b1025fbfa138f2cacf   2022-02-14 16:09:21    3.41 kB          blob
7fac07fb303e0589b9c23e   2022-02-14 16:09:18    0.17 kB          blob
8e860504ff1ee5dc795367   2022-02-14 16:09:21    56.36 MB         blob
9d99b9777eb02b8943c0e7   2022-02-14 16:09:17    0.62 kB          blob
9fb6c798fa41e509b58bcc   2022-02-14 16:09:17    47.54 MB         blob
d010c8cf75d7eb5d2504d5   2022-02-14 16:09:18    0.85 kB          blob
f2a852991b0a36a9f3d6b2   2022-02-14 16:09:22    1.12 kB          blob

There are 1 container file(s) using 92.23 MB and 8 oci blob file(s) using 103.90 MB of space
Total space used: 196.12 MB
gizmok170[~]: singularity pull docker://godlovedc/lolcow
FATAL:   Image file already exists: "lolcow_latest.sif" - will not overwrite
gizmok170[~]: singularity --version
singularity version 3.5.3

There is an issue from deep-dark history that suggests this should have been fixed by the time this came out, but it's apparent that pull is still ignoring the value of CACHEDIR

@vortexing
Copy link
Collaborator

oh. well. So that's nice. Wow. So relying on some fancy smartness on the part of singularity pull is unwise, eh? Excellent to know.

@atombaby
Copy link
Contributor Author

I'm going to request an updated singularity, see if that's fixed in later versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants