-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: get ensembl-reference wrapper to download more than one chromosome #3432
base: master
Are you sure you want to change the base?
fix: get ensembl-reference wrapper to download more than one chromosome #3432
Conversation
📝 Walkthrough📝 Walkthrough📝 WalkthroughWalkthroughThe changes in this pull request involve modifications to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script
participant Curl
User->>Script: Execute script
Script->>Script: Determine suffixes based on datatype
Script->>Script: Check if chromosome is specified
alt Chromosome not specified
Script->>Script: Attempt download
Script->>Curl: Execute download command
Script->>Script: Break loop after first success
else Chromosome specified
Script->>Script: Iterate over suffixes
Script->>Curl: Execute download command for each suffix
end
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
bio/reference/ensembl-sequence/wrapper.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
🔇 Additional comments (1)
bio/reference/ensembl-sequence/wrapper.py (1)
Line range hint 19-23
: Enhance input validation and error handling.
Consider the following improvements for better robustness:
- Validate chromosome values against a list of valid chromosomes
- Provide more specific error messages for invalid chromosomes
- Consider adding a check for output file size after each download
Let's verify the valid chromosome values for the specified species:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)
Line range hint
58-66
: Add success logging for better debugging.While the success flag initialization is correct, adding logging would improve debugging capabilities.
shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}") success = True + shell.logger.info(f"Successfully downloaded sequence from {url}")
🧰 Tools
🪛 Ruff
47-48: Use a single
if
statement instead of nestedif
statementsCombine
if
statements usingand
(SIM102)
53-53: Undefined name
snakemake
(F821)
Line range hint
57-70
: Consider enhancing multiple chromosome download implementation.To better support multiple chromosome downloads, consider these improvements:
- Track successful and failed downloads separately
- Provide summary of download results
- Handle partial successes appropriately
Example approach:
successful_downloads = [] failed_downloads = [] for suffix in suffixes: url = f"{url_prefix}.{suffix}" try: shell("curl -sSf {url} > /dev/null 2> /dev/null") shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}") successful_downloads.append(suffix) shell.logger.info(f"Successfully downloaded {suffix}") except sp.CalledProcessError: failed_downloads.append(suffix) shell.logger.warning(f"Failed to download {suffix}") continue if successful_downloads: shell.logger.info(f"Successfully downloaded: {', '.join(successful_downloads)}") if failed_downloads: shell.logger.warning(f"Failed to download: {', '.join(failed_downloads)}") if not successful_downloads: exit(1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
bio/reference/ensembl-sequence/wrapper.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
🪛 Ruff
bio/reference/ensembl-sequence/wrapper.py
53-53: Undefined name snakemake
(F821)
🔇 Additional comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)
50-50
: LGTM: Improved error message clarity.
The error message is now more precise and better explains the requirements for chromosome selection.
68-69
: Previous critical issue still applies.
The break statement still prevents downloading multiple chromosomes, which contradicts the PR objective.
for suffix in suffixes: | ||
success = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be outside the loop to check if at least one suffix was successful? This way it will only check the last suffix, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is outside of the loop, and we are requesting multiple chromosomes, this will turn true on any working chromosome, and then stay that way. So we will not get any debugging output and error thrown, in case any of the chromosomes is not available. So for the chromosomes case, we should reset this for every suffix in suffixes
. For the other case, checking whether "dna.primary_assembly.fa.gz"
or "dna.toplevel.fa.gz"
is available, it will break
out of the suffix in suffixes
loop right after setting success = True
and will otherwise be left with success = False
after the last suffix that runs into the except:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this way it only checks if the last chromosome was available, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right. Very good catch. Let me think about what the best solution is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe moving the error checking directly to the try/except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
bio/reference/ensembl-sequence/wrapper.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
bio/reference/ensembl-sequence/wrapper.py (2)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
Pattern **/wrapper.py
: Do not complain about use of undefined variable called snakemake
.
🪛 Ruff
bio/reference/ensembl-sequence/wrapper.py
53-53: Undefined name snakemake
(F821)
🔇 Additional comments (2)
bio/reference/ensembl-sequence/wrapper.py (2)
50-50
: LGTM: Improved error message clarity.
The error message is now more precise and user-friendly.
75-76
: LGTM: Correct handling of non-chromosome case.
The break statement is correctly guarded to only exit early when not downloading specific chromosomes.
if chromosome: | ||
print( | ||
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.", | ||
file=sys.stderr, | ||
) | ||
break | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Remove break statement and use proper logging.
The current implementation has two issues:
- The
break
statement prevents downloading multiple chromosomes - Direct print statements should use Snakemake's logging mechanism
Apply these changes:
if chromosome:
- print(
- f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.",
- file=sys.stderr,
- )
- break
+ shell.logger.error(
+ f"Unable to download chromosome sequence from: {url_prefix}.{suffix}"
+ )
+ continue
else:
continue
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if chromosome: | |
print( | |
f"Unable to download the requested chromosome sequence from Ensembl at: {url_prefix}.{suffix}.", | |
file=sys.stderr, | |
) | |
break | |
else: | |
if chromosome: | |
shell.logger.error( | |
f"Unable to download chromosome sequence from: {url_prefix}.{suffix}" | |
) | |
continue | |
else: |
for suffix in suffixes: | ||
success = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Track download success for each chromosome separately.
The current implementation resets the success flag for each suffix, which could mask failures of individual chromosome downloads. Consider tracking success per chromosome:
-success = False
+successes = set() # Track successful downloads
Then update the success tracking after download:
-success = True
+successes.add(suffix) # Record successful download
And modify the final check:
-if not success:
+if not successes:
This change will help identify which specific chromosomes failed to download.
Committable suggestion skipped: line range outside the PR's diff.
if not chromosome: | ||
if len(suffixes) > 1: | ||
url = f"{url_prefix}.[{'|'.join(suffixes)}]" | ||
else: | ||
url = f"{url_prefix}.{suffixes[0]}" | ||
print( | ||
f"Unable to download the requested reference sequence data from Ensembl at: {url}.", | ||
file=sys.stderr, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error reporting with Snakemake logging.
The error reporting can be improved by:
- Using Snakemake's logging mechanism
- Providing more detailed feedback about failed chromosomes
-if not success:
+if not successes:
if not chromosome:
if len(suffixes) > 1:
url = f"{url_prefix}.[{'|'.join(suffixes)}]"
else:
url = f"{url_prefix}.{suffixes[0]}"
- print(
+ shell.logger.error(
f"Unable to download the requested reference sequence data from Ensembl at: {url}.",
- file=sys.stderr,
)
+else:
+ failed = set(suffixes) - successes
+ if failed:
+ shell.logger.error(
+ f"Failed to download the following chromosomes: {', '.join(failed)}"
+ )
-print(
+shell.logger.error(
"Please check whether above URL is currently available (might be a temporal server issue). "
"Apart from that, did you check that this combination of species, build, and release is actually provided?",
- file=sys.stderr,
)
Committable suggestion skipped: line range outside the PR's diff.
Currently, only the first chromosome from the list is downloaded.
QC
snakemake-wrappers
.While the contributions guidelines are more extensive, please particularly ensure that:
test.py
was updated to call any added or updated example rules in aSnakefile
input:
andoutput:
file paths in the rules can be chosen arbitrarilyinput:
oroutput:
)tempfile.gettempdir()
points tometa.yaml
contains a link to the documentation of the respective tool or command underurl:
Summary by CodeRabbit
New Features
Bug Fixes