-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow concurrent and large refgenie downloads #1172
base: master
Are you sure you want to change the base?
Conversation
feat: Lofreq indelqual wrapper (snakemake#1166)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
To get your new rule tested, have a look at the existing refgenie
test (in test.py
):
Lines 4322 to 4330 in 0d2c92a
@skip_if_not_modified | |
def test_refgenie(): | |
try: | |
shutil.copytree("bio/refgenie/test/genome_folder", "/tmp/genome_folder") | |
except FileExistsError: | |
# no worries, the directory is already there | |
pass | |
os.environ["REFGENIE"] = "/tmp/genome_folder/genome_config.yaml" | |
run("bio/refgenie", ["snakemake", "--cores", "1", "--use-conda", "-F"]) |
You would need to include another test that requires your new rule's output. However, before doing so, it might make sense to change this to an example that is only slighlty over the threshold of 5GB, because otherwise the test would do an excessive download (for example the 25GB you mentioned) and fill up the GitHub Action VM's disk space every time it is run (even though it is only run if anything changes in this wrapper).
except RefgenconfError: | ||
# If read lock timeout, attempt to skip the read lock | ||
rgc = refgenconf.RefGenConf( | ||
conf_path, writable=True, skip_read_lock=True, genome_exact=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't really find out what the exact implications of skip_read_lock=TRUE
are, but it seems dangerous to use, to me. Have you also tried increasing wait_max=
as an alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.
I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the wait_max=
should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.
Do you have the possibility to try wait_max=
in your use case and test whether this actually helps?
# pull asset if necessary | ||
gat, archive_data, server_url = rgc.pull(genome, asset, tag, force=False) | ||
gat, archive_data, server_url = rgc.pull( | ||
genome, asset, tag, force=False, force_large=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is force_large=True
a good general default, or would it make more sense to make this settable via the params:
keyword in the rule definition? I am assuming their default of prompting has a reason, to avoid accidental downloads of huge reference data, and having to explicitly specify this via params:
would at least be a minimal sanity check that the user knows what they are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a good alternative to implement. As is, there is no way to override this while using the wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel comfortable implementing this?
I'd introduce an (optional) params: force_large=True
in the (one of the) examples, and parse this here in the wrapper.py
with force_large=snakemake.params.get("force_large", None)
, so defaulting to what the default in the original function is, only changing it if this is a deliberate choice by the user.
sync upstream
chore: sync fork
chore: release 1.26.0
This PR was marked as stale because it has been open for 6 months with no activity. |
Description
After running into a few instances of the refgenie wrapper downloading individual assets concurrently, I was met with a lock-timeout error when attempting to pull two large assets (I'm looking at you hg38_cdna/salmon_index at ~25Gb). The default timeout is 60 seconds, but I wanted to attempt to handle this error.
So, I slightly modified the refgenie wrapper to attempt to handle the
RefgenconfError
generated when the lock is unable to be obtained to skip the lock requirement. This may or may not be desirable behavior across all pipelines, but it did resolve the issues with mine as well as pass the testing requirements.I added a rule that mimics obtaining a large asset, but I am not familiar enough yet with the wrapper system to know if simply adding a rule means that it is tested.
Additionally, the inclusion of
force_large=True
was necessary to download assets larger than thelarge
threshold of 5GbQC
For all wrappers added by this PR,
input:
andoutput:
file paths in the resulting rule can be changed arbitrarily,threads: x
statement withx
being a reasonable default,map_reads
for a step that maps reads),environment.yaml
specifications follow the respective best practices,input:
oroutput:
),Snakefile
s and their entries are explained via comments (input:
/output:
/params:
etc.),stderr
and/orstdout
are logged correctly (log:
), depending on the wrapped tool,tempfile.gettempdir()
points to (see here; this also means that using any Pythontempfile
default behavior works),meta.yaml
contains a link to the documentation of the respective tool or command,Snakefile
s pass the linting (snakemake --lint
),Snakefile
s are formatted with snakefmt,