feat: support remote snakefile via gh/gl shorthand#4161
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds remote Snakefile support and hosted-source shorthand ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as CLI / SnakemakeApi
participant Resolver as resolve_snakefile
participant Inferrer as infer_source_file
participant Provider as HostingProviderFile
participant Remote as Remote Host
participant FS as Local Filesystem
User->>CLI: request workflow(snakefile = "gh:owner/repo@ref[:path]" / URL / Path)
CLI->>Resolver: resolve_snakefile(snakefile)
Resolver->>Inferrer: infer_source_file(input)
Inferrer->>Provider: parse shorthand or recognize URL
alt shorthand or remote URL
Provider->>Remote: construct raw content URL (encode ref/path) and GET
Remote-->>Provider: Snakefile content or 200
Provider-->>Resolver: SourceFile (URI)
else local Path
Inferrer->>FS: verify/resolve local Path
FS-->>Resolver: Path
end
Resolver-->>CLI: resolved snakefile (Path | URI)
CLI->>User: execute workflow / build DAG with resolved includes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/snakemake/cli.py (1)
546-554:⚠️ Potential issue | 🟡 MinorMention URLs and
gh:/gl:shorthand in--snakefilehelp.The parser now accepts non-local Snakefile strings, but the help still says “FILE”/“snakefile” only, making the new CLI feature hard to discover.
Proposed help text update
help=( - "The workflow definition in form of a snakefile. " + "The workflow definition as a local Snakefile, HTTP(S) URL, " + "or hosted-source shorthand like gh:owner/repo@ref:path. " "Usually, you should not need to specify this. "🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/cli.py` around lines 546 - 554, Update the --snakefile argument help (the help string built around SNAKEFILE_CHOICES in src/snakemake/cli.py) to mention that the argument accepts non-local sources: URLs (http/https), GitHub/GitLab shorthand (gh:owner/repo[:path] and gl:owner/repo[:path]), and local file paths; keep the existing sentence about default search order using SNAKEFILE_CHOICES but append a short example of the allowed non-local forms so users discover the feature when running --help for --snakefile.
🧹 Nitpick comments (2)
tests/test_api.py (1)
87-91: Consider also asserting the full expected output set matches.Current check only verifies each expected file exists and has the right MD5, but does not catch extra unexpected outputs. Optional; the existing
test_remote_snakefile_multiple_includesintests/tests.pyalready covers equivalence viarun(...), so this is nice-to-have.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_api.py` around lines 87 - 91, The test currently iterates get_expected_files(expected_results) and checks existence and MD5 but doesn't fail on extra files; after validating each expected file (using get_expected_files, workdir, expected_results, md5sum), compute the actual set of relative file paths under workdir (e.g., via workdir.rglob and relative_to(workdir) for files) and assert that set(actual_relpaths) == set(get_expected_files(expected_results)) so any unexpected outputs cause the test to fail.docs/executing/cli.rst (1)
40-40: Minor: "snakedeploy" link text / positioning.The inline
snakedeploylink inside a.. note::is fine, but the phrasing "usually more convenient" reads slightly awkward next to "it will only retrieve the snakefiles and everything directly referred by them". Consider "…and anything directly referenced by them…" for readability. Non-blocking nit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/executing/cli.rst` at line 40, Update the note sentence that currently reads "will only retrieve the snakefiles and everything directly referred by them, allowing an ad-hoc run." to improve readability by replacing "everything directly referred by them" with "anything directly referenced by them" and keep the inline snakedeploy link attached to the callout "use snakedeploy" (i.e., ensure the link remains on the phrase "snakedeploy" following "use") so the sentence becomes clearer while preserving the link target.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/snakefiles/deployment.rst`:
- Around line 89-92: The documentation incorrectly implies that adding an
explicit path enables persistent caching; instead update the paragraph that
discusses the module dna_seq (tag v2.0.1) and the shorthand examples
(gh:snakemake-workflows/dna-seq-gatk-variant-calling@v2.0.1 and
gh:snakemake-workflows/dna-seq-gatk-variant-calling@v2.0.1:workflow/Snakefile)
to state that cacheability is determined by using an immutable reference (a tag
or commit), not by including an explicit path, and clarify that both shorthand
forms shown are cacheable because they use the tag v2.0.1 while the presence or
absence of the path does not affect caching.
In `@docs/snakefiles/modularization.rst`:
- Around line 399-402: Add a short paragraph after the hosted source syntax
examples explaining that using the provider shorthand (e.g.,
gh:owner/repo@ref:path/to/Snakefile or gl:group/project@ref:path/to/Snakefile)
is preferred when refs (branch or tag names) may contain slashes, because
attempting to parse full hosted/raw URLs (including
GitHub/raw.githubusercontent.com URLs) to infer a GithubFile is fundamentally
ambiguous when refs contain “/”; document that full URLs remain generic and can
be ambiguous and therefore recommend the provider-prefixed shorthand as the
unambiguous form.
In `@src/snakemake/gui.py`:
- Around line 74-77: The direct call to is_local_file(args.snakefile) can raise
for URI-like inputs and crash GUI registration; wrap the call in a try/except
and on any exception treat the value as a remote path (i.e., set
app.extensions["snakefilepath"] = args.snakefile). Specifically, modify the
block that sets app.extensions["snakefilepath"] to call
is_local_file(args.snakefile) inside a try, use os.path.abspath(...) when it
returns True, and in the except branch (or when it returns False) assign the
original args.snakefile so remote/URI values are preserved; reference
is_local_file and the app.extensions["snakefilepath"] assignment to locate the
code.
In `@src/snakemake/sourcecache.py`:
- Around line 852-855: The cache key for hosted repositories only uses self.repo
and can collide across different providers/hosts; update the hosted_repo cache
to include provider and host so clones are unique per (provider, host, repo).
Locate the hosted_repo method/attribute and change the cache dictionary key
generation (where self.repo is currently used) to include self.provider (or
provider) and self.host (e.g., use a tuple like (self.provider, self.host,
self.repo) or a concatenated string) so that GithubFile and GitlabFile instances
with different hosts/providers do not share the same entry; ensure any lookups
and inserts use the same new composite key.
- Around line 662-668: The cache key collision comes from encoding the ref in
GithubFile.get_path_or_uri() (turning "perf/autobump" into "perf%2Fautobump")
while SourceFile.get_cache_path() later unquotes the URL path, making
"perf%2Fautobump" and "perf:autobump" collapse; fix by preserving encoded ref
when constructing cache keys instead of unquoting, i.e. change
SourceFile.get_cache_path() to use the already-encoded ref from
GithubFile.get_path_or_uri() (or re-encode the ref using quote(safe="") before
building the cache path) so slashes in refs remain %2F and distinct; update any
unquote(...) call that currently normalizes the URL path to avoid reverting
encoded slashes.
In `@tests/common.py`:
- Around line 46-47: The log_message method in tests/common.py currently uses
the parameter name "format" which shadows the built-in and triggers Ruff A002;
rename that parameter (e.g., to "fmt" or "message_format") in the def
log_message(self, ...) signature and update all usages inside the log_message
function (and any local references) to the new name so calls and behavior remain
identical while avoiding the builtin shadowing.
---
Outside diff comments:
In `@src/snakemake/cli.py`:
- Around line 546-554: Update the --snakefile argument help (the help string
built around SNAKEFILE_CHOICES in src/snakemake/cli.py) to mention that the
argument accepts non-local sources: URLs (http/https), GitHub/GitLab shorthand
(gh:owner/repo[:path] and gl:owner/repo[:path]), and local file paths; keep the
existing sentence about default search order using SNAKEFILE_CHOICES but append
a short example of the allowed non-local forms so users discover the feature
when running --help for --snakefile.
---
Nitpick comments:
In `@docs/executing/cli.rst`:
- Line 40: Update the note sentence that currently reads "will only retrieve the
snakefiles and everything directly referred by them, allowing an ad-hoc run." to
improve readability by replacing "everything directly referred by them" with
"anything directly referenced by them" and keep the inline snakedeploy link
attached to the callout "use snakedeploy" (i.e., ensure the link remains on the
phrase "snakedeploy" following "use") so the sentence becomes clearer while
preserving the link target.
In `@tests/test_api.py`:
- Around line 87-91: The test currently iterates
get_expected_files(expected_results) and checks existence and MD5 but doesn't
fail on extra files; after validating each expected file (using
get_expected_files, workdir, expected_results, md5sum), compute the actual set
of relative file paths under workdir (e.g., via workdir.rglob and
relative_to(workdir) for files) and assert that set(actual_relpaths) ==
set(get_expected_files(expected_results)) so any unexpected outputs cause the
test to fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9017739d-5007-44eb-8eb5-64d456dadcae
📒 Files selected for processing (11)
docs/executing/cli.rstdocs/snakefiles/deployment.rstdocs/snakefiles/modularization.rstsrc/snakemake/api.pysrc/snakemake/cli.pysrc/snakemake/gui.pysrc/snakemake/sourcecache.pytests/common.pytests/test_api.pytests/test_sourcecache.pytests/tests.py
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/snakemake/sourcecache.py (2)
714-717:⚠️ Potential issue | 🟠 MajorInclude provider and host in the hosted-repo cache key.
These shorthand constructors can now produce the same
repoon different providers/hosts, butHostingProviderFile.hosted_repostill caches byself.repoonly. The second source can reuse the first clone and read from the wrong remote.Proposed fix
-from typing import TYPE_CHECKING, Any, ClassVar, Dict, List, Optional +from typing import TYPE_CHECKING, Any, ClassVar, Dict, List, Optional, Tuple @@ - _hosted_repos: ClassVar[Dict[str, HostedGitRepo]] = {} + _hosted_repos: ClassVar[Dict[Tuple[type, str, str], HostedGitRepo]] = {} @@ `@property` def hosted_repo(self) -> HostedGitRepo: + assert self.host is not None + cache_key = (self.__class__, self.host, self.repo) try: - return self._hosted_repos[self.repo] + return self._hosted_repos[cache_key] except KeyError: - assert self.host is not None if self.host.startswith("https://") or self.host.startswith("http://"): raise WorkflowError( @@ # Ensure that multiple threads don't concurrently create the same instance with self._lock: - hosted_repo = HostedGitRepo( - self.repo, self.cache_path, self.auth, self.host - ) - self._hosted_repos[self.repo] = hosted_repo - return hosted_repo + try: + return self._hosted_repos[cache_key] + except KeyError: + hosted_repo = HostedGitRepo( + self.repo, self.cache_path, self.auth, self.host + ) + self._hosted_repos[cache_key] = hosted_repo + return hosted_repo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/sourcecache.py` around lines 714 - 717, The hosted-repo cache is keyed only by self.repo, so shorthand constructors (GithubFile/GitlabFile) that differ by provider or host can collide; update the caching key used by hosted_repo to include provider and host (e.g., use a tuple like (self.repo, self.provider, self.host) or equivalent) and ensure GithubFile and GitlabFile instances expose the provider/host attributes used in that key so clones for different providers/hosts are cached separately.
636-642:⚠️ Potential issue | 🟠 MajorKeep slashed refs distinct in hosted-file cache paths.
refis encoded here, butSourceFile.get_cache_path()later unquotes the URI path. That can makegh:o/r@perf/autobump:path/to/Snakefileshare a cache entry withgh:o/r@perf:autobump/path/to/Snakefile.Proposed fix
-from urllib.parse import unquote +from urllib.parse import quote, unquote @@ class HostingProviderFile(SourceFile): @@ def get_filename(self): return os.path.basename(self.path) + def get_cache_path(self): + assert self.host is not None + return os.path.join( + "hosted-git", + self.__class__.__name__.lower(), + self.host, + quote(self.repo, safe="/"), + quote(self.ref, safe=""), + quote(self.path, safe="/"), + ) + @@ def get_path_or_uri(self, secret_free: bool) -> str: - from urllib.parse import quote - auth = f":{self.token}@" if self.token and not secret_free else ""Verification:
#!/bin/bash # Demonstrate how the current quote-then-unquote cache-key path collapses # a slashed ref into the same path as a shorter ref plus deeper file path. python - <<'PY' from urllib.parse import quote, unquote import os def current_cache_suffix(ref, path): uri_path = f"raw.githubusercontent.com/o/r/{quote(ref, safe='')}/{quote(path, safe='/')}" return os.path.join("https", unquote(uri_path.lstrip("/"))) a = current_cache_suffix("perf/autobump", "path/to/Snakefile") b = current_cache_suffix("perf", "autobump/path/to/Snakefile") print(a) print(b) assert a == b, "Expected current cache-key derivation to collide" PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/sourcecache.py` around lines 636 - 642, The cache collision happens because ref is percent-encoded (quote(self.ref, safe="")) then later unquoted, turning %2F back into '/' and collapsing distinct refs into the same cache path; change the encoding so slashes in the ref remain literal path separators by using quote(self.ref, safe="/") instead of safe="" in the code that builds the hosted-file URL (the block that sets ref = quote(self.ref, ...) and path = quote(self.path, ...)); this preserves slashed refs as distinct path segments when SourceFile.get_cache_path() later unquotes the URI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/snakemake/sourcecache.py`:
- Around line 714-717: The hosted-repo cache is keyed only by self.repo, so
shorthand constructors (GithubFile/GitlabFile) that differ by provider or host
can collide; update the caching key used by hosted_repo to include provider and
host (e.g., use a tuple like (self.repo, self.provider, self.host) or
equivalent) and ensure GithubFile and GitlabFile instances expose the
provider/host attributes used in that key so clones for different
providers/hosts are cached separately.
- Around line 636-642: The cache collision happens because ref is
percent-encoded (quote(self.ref, safe="")) then later unquoted, turning %2F back
into '/' and collapsing distinct refs into the same cache path; change the
encoding so slashes in the ref remain literal path separators by using
quote(self.ref, safe="/") instead of safe="" in the code that builds the
hosted-file URL (the block that sets ref = quote(self.ref, ...) and path =
quote(self.path, ...)); this preserves slashed refs as distinct path segments
when SourceFile.get_cache_path() later unquotes the URI.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 32591b33-601f-4b4e-aaa7-f30a683209f1
📒 Files selected for processing (1)
src/snakemake/sourcecache.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/sourcecache.py`:
- Around line 696-701: In the shorthand parsing block (variables first_colon,
first_slash, host, remainder), ensure an empty host (e.g., "gh::owner/repo") is
not accepted: change the logic so after finding first_colon you extract
candidate = remainder[:first_colon] and only set host = candidate and advance
remainder = remainder[first_colon+1:] if candidate is non-empty; if candidate is
empty, treat it as no custom host (set host = None) and do not consume the colon
so the shorthand remains valid against the default host handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f9167ee6-e584-43fd-b048-abd73a58222b
📒 Files selected for processing (7)
docs/snakefiles/deployment.rstdocs/snakefiles/modularization.rstsrc/snakemake/gui.pysrc/snakemake/sourcecache.pytests/common.pytests/test_api.pytests/test_sourcecache.py
✅ Files skipped from review due to trivial changes (2)
- docs/snakefiles/modularization.rst
- docs/snakefiles/deployment.rst
🚧 Files skipped from review as they are similar to previous changes (3)
- src/snakemake/gui.py
- tests/common.py
- tests/test_sourcecache.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/snakemake/sourcecache.py (1)
733-734: Trivial wrapper.
_infer_hosting_provider_filecurrently just forwards to_infer_hosting_provider_file_shorthand. If you don't plan to add other inference branches in this PR, you can inline the call at line 749 and drop this function.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/sourcecache.py` around lines 733 - 734, The function _infer_hosting_provider_file is a trivial passthrough to _infer_hosting_provider_file_shorthand; remove the wrapper and replace its call sites with direct calls to _infer_hosting_provider_file_shorthand(path_or_uri) (preserving the Optional[HostingProviderFile] return expectation), then delete the _infer_hosting_provider_file definition to avoid dead indirection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/sourcecache.py`:
- Around line 705-730: The parser currently allows an empty ref (e.g.,
gh:owner/repo@:path) because ref is set to "" when ref_cand is empty; update the
shorthand parsing around ref_cand/ref (after ref_path.rpartition(":")) to reject
an empty ref by returning None when sep is truthy but ref_cand == "" (or
equivalently when ref == ""); this prevents creating GithubFile/GitlabFile with
branch="" and lets _check_git_args see a missing branch instead of an empty
string.
- Around line 751-754: The except block that calls is_local_file(path_or_uri)
currently swallows NotImplementedError, ImportError, and ValueError and silently
treats the path as non-local; update the handling in sourcecache.py so that when
is_local_file raises one of these exceptions you either (a) log the exception at
debug level (including the exception type and message) before setting
is_local=False to preserve diagnostics, or (b) add a clear comment above the
try-except explaining that malformed/unsupported URIs intentionally fall back to
GenericSourceFile; reference is_local_file, path_or_uri, and the fallback to
GenericSourceFile in your change so reviewers can find the location easily.
---
Nitpick comments:
In `@src/snakemake/sourcecache.py`:
- Around line 733-734: The function _infer_hosting_provider_file is a trivial
passthrough to _infer_hosting_provider_file_shorthand; remove the wrapper and
replace its call sites with direct calls to
_infer_hosting_provider_file_shorthand(path_or_uri) (preserving the
Optional[HostingProviderFile] return expectation), then delete the
_infer_hosting_provider_file definition to avoid dead indirection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5aeab04e-21c6-4e49-ba3d-9abe6994eed9
📒 Files selected for processing (2)
src/snakemake/sourcecache.pytests/test_sourcecache.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_sourcecache.py
johanneskoester
left a comment
There was a problem hiding this comment.
Hi Clemens! Thanks a lot, this is very nice. Please find my comments below.
| # Unsupported or malformed URIs should fall through to GenericSourceFile below. | ||
| try: | ||
| is_local = is_local_file(path_or_uri) | ||
| except (NotImplementedError, ImportError, ValueError): |
There was a problem hiding this comment.
Why would these exceptions occur in is_local_file? It worries me that we have to keep this in mind everytime we use is_local_file (this PR does it above as well). Ideally, is_local_file should not throw exceptions other than stuff caused by bugs, which in turn should not need to be handled.
There was a problem hiding this comment.
Agreed. I changed is_local_file() so it behaves as a direct predicate for normal path/URI inputs instead of depending on smart_open.parse_uri(). Remote or unsupported schemes such as gh:, gl:, and s3:// now return False without callers needing to catch transport/parser exceptions.
| if isinstance(path, str): | ||
| try: | ||
| is_local = is_local_file(path) | ||
| except (NotImplementedError, ImportError, ValueError): |
There was a problem hiding this comment.
see my comment on the other try-except around is_local_file below.
There was a problem hiding this comment.
Addressed via the same helper change. resolve_snakefile() now calls is_local_file() directly and preserves remote strings unchanged.
| app.extensions["snakefilepath"] = os.path.abspath(args.snakefile) | ||
| try: | ||
| is_local = is_local_file(args.snakefile) | ||
| except (NotImplementedError, ImportError, ValueError): |
There was a problem hiding this comment.
Addressed via the same helper change. The GUI path handling now calls is_local_file() directly and keeps remote snakefile values unchanged.
| self._hosted_repos[self.repo] = hosted_repo | ||
| return hosted_repo | ||
| try: | ||
| return self._hosted_repos[cache_key] |
There was a problem hiding this comment.
Why the second try here to get the object from the cache dict? Didn't we do that before in line 515?
There was a problem hiding this comment.
I rewrote this to make the intent clearer. The second lookup is still there, now in the form of .get() inside the lock, to handle the case where another thread populates the cache after the initial fast-path miss but before this thread acquires _lock. That preserves one HostedGitRepo instance per cache key. If you prefer removing the inner check entirely, let me know.
| def _infer_hosting_provider_file(path_or_uri: str) -> Optional[HostingProviderFile]: | ||
| return _infer_hosting_provider_file_shorthand(path_or_uri) |
There was a problem hiding this comment.
Why this function if it just delegates?
There was a problem hiding this comment.
You're right. I removed the wrapper and now call _infer_hosting_provider_file_shorthand() directly.
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
b633509 to
b042c7c
Compare
I had the impression that #4077 was too big to be reviewed and that the preference was to initially only support the
gh/glshorthand. I've therefore extracted this part out so that it can be reviewed more easily and eventually be merged. I think that almost everything has been reviewed already, but please take another look.This PR adds support for using remote Snakefiles via the explicit hosted-source shorthand syntax:
gh:owner/repo@ref:path/to/Snakefilegl:group/project@ref:path/to/SnakefileThe trailing path remains optional and defaults to
workflow/Snakefile.The explicit shorthand form should be unambiguous and avoids the review concerns around parsing full hosted URLs whose refs may contain
/. Generic remote Snakefile execution via plain HTTP/HTTPS URLs still remains supported; this PR only avoids the provider-specific inference from full GitHub/GitLab URLs.QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).AI-assistance disclosure
I used AI assistance for:
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests