-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: parse file format from extension (ignore compression) #41
base: master
Are you sure you want to change the base?
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe changes in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Snakemake
User->>Snakemake: Call get_format(path)
Snakemake->>Snakemake: Extract suffix using pathlib
Snakemake->>Snakemake: Check for specific extensions
alt Match found
Snakemake->>User: Return corresponding format
else No match
Snakemake->>User: Return lowercased extension
end
Warning Rate limit exceeded@fgvieira has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 29 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_wrapper_utils/snakemake.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_wrapper_utils/snakemake.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff (0.8.0)
snakemake_wrapper_utils/snakemake.py
26-26: SyntaxError: Expected a statement
26-27: SyntaxError: Expected a statement
🔇 Additional comments (2)
snakemake_wrapper_utils/snakemake.py (2)
4-4
: LGTM!
The docstring improvement adds proper punctuation.
29-43
: Consider these improvements to the format detection implementation
While the core logic is sound, there are several potential improvements:
- Move the
pathlib
import to the top of the file:
+from pathlib import Path
+
def get_mem(snakemake, out_unit="MiB"):
...
def get_format(path):
"""Get file format from extension, ignoring common compressions."""
- from pathlib import Path
- Consider making the format mappings more maintainable:
+# Define format mappings at module level
+FORMAT_MAPPINGS = {
+ 'fastq': ('.fq', '.fastq'),
+ 'fasta': ('.fa', '.fas', '.fna', '.fasta'),
+}
+
def get_format(path):
"""Get file format from extension, ignoring common compressions."""
exts = Path(path).suffixes
if exts[-1] in (".gz", ".bgz", ".bz2"):
ext = exts[-2]
else:
ext = exts[-1]
- if ext in (".fq", ".fastq"):
- return "fastq"
- elif ext in (".fa", ".fas", ".fna", ".fasta"):
- return "fasta"
- else:
- return ext.lower()
+ for fmt, extensions in FORMAT_MAPPINGS.items():
+ if ext in extensions:
+ return fmt
+ return ext.lower()
Let's verify if there are any other file format mappings we should consider:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
snakemake_wrapper_utils/snakemake.py (2)
38-43
: Consider using a mapping dictionary for format definitionsThe current if-elif chain could be replaced with a more maintainable mapping structure.
Consider refactoring to:
+ FORMAT_MAPPING = { + ".fq": "fastq", + ".fastq": "fastq", + ".fa": "fasta", + ".fas": "fasta", + ".fna": "fasta", + ".fasta": "fasta" + } - if ext in (".fq", ".fastq"): - return "fastq" - elif ext in (".fa", ".fas", ".fna", ".fasta"): - return "fasta" - else: - return ext.lower() + return FORMAT_MAPPING.get(ext, ext.lower())
31-31
: Move import statement to top of fileThe
pathlib
import should be at the module level rather than inside the function.Move the import statement to the top of the file with other imports.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_wrapper_utils/snakemake.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_wrapper_utils/snakemake.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff (0.8.0)
snakemake_wrapper_utils/snakemake.py
26-26: SyntaxError: Expected a statement
26-27: SyntaxError: Expected a statement
🔇 Additional comments (3)
snakemake_wrapper_utils/snakemake.py (3)
4-4
: LGTM!
The docstring improvement enhances readability.
25-26
:
Fix syntax error in return statement
There's an invalid colon after the return statement which would cause a syntax error.
Apply this fix:
- return arg in cmd.replace("=", " ").split(" "):
+ return arg in cmd.replace("=", " ").split(" ")
🧰 Tools
🪛 Ruff (0.8.0)
26-26: SyntaxError: Expected a statement
29-43
:
Add input validation and improve robustness
The function needs input validation and could be more robust in handling edge cases.
Apply these improvements:
def get_format(path):
"""Get file format from extension, ignoring compression."""
from pathlib import Path
+ if not path:
+ raise ValueError("Path cannot be empty")
exts = Path(path).suffixes
+ if not exts:
+ raise ValueError("Path must have an extension")
+
+ compression_exts = {".gz", ".bgz", ".bz2"}
- if exts[-1] in (".gz", ".bgz", ".bz2"):
+ if exts and exts[-1] in compression_exts:
ext = exts[-2] if len(exts) > 1 else ""
else:
- ext = exts[-1]
+ ext = exts[-1] if exts else ""
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake_wrapper_utils/snakemake.py (1)
38-43
: Consider improving maintainabilityThe file format mappings could be more maintainable by using constants or a mapping dictionary.
+# At module level +FASTQ_EXTENSIONS = frozenset((".fq", ".fastq")) +FASTA_EXTENSIONS = frozenset((".fa", ".fas", ".fna", ".fasta")) + def get_format(path): """Get file format from extension, ignoring compression.""" from pathlib import Path exts = Path(path).suffixes if exts[-1] in (".gz", ".bgz", ".bz2"): ext = exts[-2] else: ext = exts[-1] - if ext in (".fq", ".fastq"): + if ext in FASTQ_EXTENSIONS: return "fastq" - elif ext in (".fa", ".fas", ".fna", ".fasta"): + elif ext in FASTA_EXTENSIONS: return "fasta" else: return ext.lower()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_wrapper_utils/snakemake.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_wrapper_utils/snakemake.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff (0.8.0)
snakemake_wrapper_utils/snakemake.py
33-36: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
🔇 Additional comments (4)
snakemake_wrapper_utils/snakemake.py (4)
4-4
: LGTM!
The docstring formatting improvement is appropriate.
25-26
: LGTM!
The simplified return statement and docstring update improve code clarity while maintaining the same functionality.
29-43
: Implementation aligns well with PR objectives
The function successfully handles file format detection while ignoring compression extensions, which matches the PR objectives perfectly.
🧰 Tools
🪛 Ruff (0.8.0)
33-36: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
31-36
: 🛠️ Refactor suggestion
Add input validation and handle edge cases
The function should validate its input and handle edge cases gracefully.
def get_format(path):
"""Get file format from extension, ignoring compression."""
from pathlib import Path
+ if not path:
+ raise ValueError("Path cannot be empty")
exts = Path(path).suffixes
+ if not exts:
+ raise ValueError("Path must have an extension")
+
if exts[-1] in (".gz", ".bgz", ".bz2"):
🧰 Tools
🪛 Ruff (0.8.0)
33-36: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
snakemake_wrapper_utils/snakemake.py (1)
41-46
: Consider making format mappings more maintainableThe format mappings are hardcoded in the function. Consider extracting them to a configuration dictionary for better maintainability and extensibility.
+FILE_FORMAT_MAPPINGS = { + "fastq": (".fq", ".fastq"), + "fasta": (".fa", ".fas", ".fna", ".fasta"), +} def get_format(path): """Get file format from extension, ignoring common compressions.""" if not path: raise ValueError("Path cannot be empty") exts = Path(path).suffixes if not exts: raise ValueError("Path must have an extension") if exts[-1] in (".gz", ".bgz", ".bz2"): ext = exts[-2] else: ext = exts[-1] - if ext in (".fq", ".fastq"): - return "fastq" - elif ext in (".fa", ".fas", ".fna", ".fasta"): - return "fasta" - else: - return ext.lstrip(".").lower() + for format_name, extensions in FILE_FORMAT_MAPPINGS.items(): + if ext in extensions: + return format_name + return ext.lstrip(".").lower()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_wrapper_utils/snakemake.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_wrapper_utils/snakemake.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff (0.8.0)
snakemake_wrapper_utils/snakemake.py
33-33: Undefined name Path
(F821)
36-39: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
🔇 Additional comments (3)
snakemake_wrapper_utils/snakemake.py (3)
4-4
: LGTM!
The docstring update improves consistency by adding proper punctuation.
25-26
: LGTM!
The changes improve documentation consistency and code simplicity while maintaining the same functionality.
36-39
: Verify compression extension handling
Let's verify that all possible compression extensions in the codebase are handled.
✅ Verification successful
Compression extensions are appropriately handled for the codebase's needs
The code correctly handles the common compression extensions (.gz, .bgz, .bz2) found in the codebase. From the search results, these are the only compression formats actually used in the codebase, particularly in the context of VCF files (.vcf.gz) and other bioinformatics data. The get_format
function appropriately strips these compression extensions to determine the underlying file format, which is exactly what's needed for the file types being processed (FASTQ, FASTA, and other bioinformatics formats).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for file extensions that might indicate compression
rg -g '!*.py' '\.(gz|bz2|zip|xz|zst|lz4|bgz)' --type-add 'workflow:*.{smk,snakefile,workflow}' -t workflow
Length of output: 257
Script:
#!/bin/bash
# Let's search for compression-related patterns in Python files
rg '\.(gz|bz2|zip|xz|zst|lz4|bgz)' -t python
# Also search for file extension handling patterns
ast-grep --pattern 'if $_ in ($$$):'
Length of output: 116
Script:
#!/bin/bash
# Search for compression-related patterns in Python files
rg '\.(gz|bz2|zip|xz|zst|lz4|bgz)' --type py
# Let's also look at the entire file content to understand the context better
cat snakemake_wrapper_utils/snakemake.py
Length of output: 1664
🧰 Tools
🪛 Ruff (0.8.0)
36-39: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake_wrapper_utils/snakemake.py (1)
29-47
: Consider making the format mappings more maintainable and extensibleWhile the implementation is functionally correct, consider making the format mappings more maintainable by using class-level dictionaries or configuration. This would make it easier to add new formats in the future.
Example approach:
class FileFormat: COMPRESSION_EXTENSIONS = {".gz", ".bgz", ".bz2"} FORMAT_MAPPINGS = { "fastq": {".fq", ".fastq"}, "fasta": {".fa", ".fas", ".fna", ".fasta"}, } @classmethod def get_format(cls, path): """Get file format from extension, ignoring common compressions.""" if not path: raise ValueError("Path cannot be empty") exts = Path(path).suffixes if not exts: raise ValueError("Path must have an extension") ext = exts[-2] if exts[-1] in cls.COMPRESSION_EXTENSIONS else exts[-1] for format_name, extensions in cls.FORMAT_MAPPINGS.items(): if ext in extensions: return format_name return ext.lstrip(".").lower()🧰 Tools
🪛 Ruff (0.8.0)
37-40: Use ternary operator
ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead ofif
-else
-blockReplace
if
-else
-block withext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake_wrapper_utils/snakemake.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake_wrapper_utils/snakemake.py (1)
Pattern **/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self
argument of methods.
Do not suggest type annotation of the cls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return
statement.
🪛 Ruff (0.8.0)
snakemake_wrapper_utils/snakemake.py
37-40: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
🔇 Additional comments (3)
snakemake_wrapper_utils/snakemake.py (3)
4-4
: LGTM!
The docstring formatting improvement enhances consistency.
25-26
: LGTM!
The changes improve both documentation consistency and code conciseness while maintaining the same functionality.
37-40
: LGTM!
While a ternary operator could be used here, the current if-else block is more readable and maintainable.
🧰 Tools
🪛 Ruff (0.8.0)
37-40: Use ternary operator ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
instead of if
-else
-block
Replace if
-else
-block with ext = exts[-2] if exts[-1] in (".gz", ".bgz", ".bz2") else exts[-1]
(SIM108)
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
is_arg
function for improved readability.