-
Notifications
You must be signed in to change notification settings - Fork 1
fix: allow multiple references in mehari server run #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes update the handling of reference genome paths in the server's run module. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Server
participant Annotator
User->>Server: Start with Args (multiple reference paths)
Server->>Server: Validate reference count vs genome releases
Server->>Server: For each genome release, match reference path by name
Server->>Annotator: Build ConsequenceAnnotator with matched reference path
Server->>Server: Log loaded sources
Server-->>User: Complete setup or return error
Possibly related PRs
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/server/run/mod.rs (2)
264-288
: Consider a more robust mechanism for matching references to genome releasesThe current implementation uses substring matching on filenames, which could lead to incorrect matches if filenames contain both "GRCh37" and "GRCh38". Additionally, if multiple references match the same genome release, only the first match is used.
A more robust approach might be to:
- Use a more explicit mapping syntax in the CLI, e.g.,
--reference GRCh38=path/to/reference.fna
- Or implement a fallback pattern-matching approach with stricter rules
let reference = { args.reference .iter() .filter_map(|reference_path| { + // First try for exact matches in format like "GRCh38.fna" or "GRCh38_reference.fna" + let filename = reference_path.file_name().map(|f| f.to_string_lossy().to_lowercase()); + let matches_pattern = filename.as_ref().map_or(false, |name| { + name == genome_release.name().to_lowercase() || + name.starts_with(&format!("{}.", genome_release.name().to_lowercase())) || + name.starts_with(&format!("{}_", genome_release.name().to_lowercase())) + }); + + if matches_pattern { + tracing::info!( + "Found reference genome for {:?}: {}", + genome_release, + reference_path.display() + ); + return Some(reference_path.clone()); + } + // Fall back to substring matching if reference_path .file_name() .map(|f| { f.to_string_lossy() .to_lowercase() .contains(&genome_release.name().to_lowercase()) }) .unwrap_or(false) { tracing::info!( "Found reference genome for {:?}: {}", genome_release, reference_path.display() ); Some(reference_path.clone()) } else { None } }) .next() };
294-323
: Consider adding a check for ambiguous reference matchesThe current implementation doesn't warn users if a reference file might match multiple genome releases.
Before this section, you could add code to check if any reference matches multiple genome releases and log a warning:
// Check for ambiguous reference matches let mut ambiguous_refs = Vec::new(); for ref_path in &args.reference { let matches = GenomeRelease::value_variants() .iter() .filter(|gr| { ref_path .file_name() .map(|f| { f.to_string_lossy() .to_lowercase() .contains(&gr.name().to_lowercase()) }) .unwrap_or(false) }) .count(); if matches > 1 { ambiguous_refs.push(ref_path); } } if !ambiguous_refs.is_empty() { tracing::warn!( "The following reference paths match multiple genome releases and may be ambiguous: {:?}", ambiguous_refs ); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/server/run/mod.rs
(6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/server/run/mod.rs (3)
src/common/mod.rs (1)
new
(151-156)src/annotate/seqvars/mod.rs (4)
new
(1390-1392)new
(1677-1679)new
(1822-1824)new
(1886-1888)src/annotate/strucvars/csq.rs (1)
new
(453-460)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Schema
- GitHub Check: Testing
- GitHub Check: Linting
- GitHub Check: build-and-push-image
🔇 Additional comments (4)
src/server/run/mod.rs (4)
120-125
: Clear and helpful documentation for the multiple reference paths.The updated documentation helpfully explains that reference paths must contain "GRCh37" or "GRCh38" for correct matching, which is crucial information for users.
290-292
: Good warning message for unmatched referencesThis warning is helpful to users when they provide references but none match a particular genome release.
379-380
: Helpful log message for loaded sourcesAdding this log message helps users understand which sources were successfully loaded.
256-259
: 🛠️ Refactor suggestionConsider allowing partial specification of references
The current implementation requires either zero references or exactly the number of supported genome releases (currently 2 - GRCh37 and GRCh38). This prevents users from specifying a reference for just one genome release, which might be a common use case.
-if !(0..=GenomeRelease::value_variants().len()).contains(&args.reference.len()) { - return Err(anyhow!("Supplied number of reference genomes does not match the number of supported genome releases ({:?}).", - GenomeRelease::value_variants())); -}You could remove this validation entirely and rely on the warnings when references are provided but none match a particular genome release (lines 290-292).
Likely an incorrect or invalid review comment.
Support multiple references in
mehari server run
, e.g.Note that this relies on the strings "GRCh38" or "GRCh37" to be present in the file path (the check is done on lowercase strings) at the moment, i.e. there is no sophisticated logic for matching the --transcripts to the --reference (and also no cli based matching, e.g. something along the lines of
--transcripts GRCh38=foo.bin.zst --transcripts GRCH37=bar.bin.zst --reference GRCh38=baz.fasta --reference GRCh37=etc
).Resolves #716
Summary by CodeRabbit
New Features
Bug Fixes
Documentation