Skip to content

Conversation

@FelixMoelder
Copy link
Contributor

@FelixMoelder FelixMoelder commented Oct 31, 2025

In its current configuration the workflow writes a lot of files into single folders (see #397). This involves especially the results/calls, results/candidate-calls and results/final-calls directories.
To reduce the amount of files the results/calls directory will contain several new subdirectories for each proccessing step of bcf files.
In addition there will also be subdirectories for each group as scatteritems still lead to a lot of files.

Summary by CodeRabbit

  • Chores
    • Reorganized workflow outputs, inputs and logs into a deeper group- and caller-scoped directory layout (affects where results and logs are written/read).
    • Updated Free Disk Space GitHub Action to v1.3.1.
    • Updated SRA data-retrieval tool wrapper to a newer version.
  • Documentation
    • Added top-level workflow header comments describing produced file organization.

@coderabbitai
Copy link

coderabbitai bot commented Oct 31, 2025

Walkthrough

Restructured Snakemake file and log paths to nest {caller} and {group} directories across many rules, added two helper functions, bumped the Free Disk Space GitHub Action version, and updated the sra-tools/fasterq-dump wrapper. No algorithmic or control-flow changes.

Changes

Cohort / File(s) Summary
GitHub Actions
\.github/workflows/main.yml
Bumped Free Disk Space Action from v1.3.0v1.3.1.
Wrapper update
workflow/rules/trimming.smk
Updated sra-tools/fasterq-dump wrapper from v5.0.2v7.6.0.
Path refactor — annotation
workflow/rules/annotation.smk
Updated input/output/log/benchmark templates to use nested {caller} and {group} directories and adjusted annotated/db/dgidb output locations.
Path refactor — calling
workflow/rules/calling.smk
Reworked varlociraptor and related rule paths to nest {caller}/{group}; adjusted corresponding logs and benchmarks.
Path refactor — candidate calling
workflow/rules/candidate_calling.smk
Moved candidate outputs into results/calls/candidates/{caller}/{group}/...; updated freebayes, delly, fix_delly_calls, filter_offtarget_variants, scatter_candidates inputs/outputs/logs.
Path refactor — fusion calling
workflow/rules/fusion_calling.smk
Moved Arriba VCF/BCF outputs into results/calls/candidates/arriba/{sample}/{sample}.* and group concat into results/calls/candidates/arriba/{group}/{group}.bcf.
Path refactor — common helpers
workflow/rules/common.smk
Adjusted many path templates to new nested layout; added get_annotate_dgidb_input(wildcards) and get_final_selected_annotation() utility functions and updated callers.
Path refactor — filtering & final outputs
workflow/rules/filtering.smk, workflow/rules/mutational_burden.smk, workflow/rules/population.smk
Redirected inputs/outputs/logs to nested results/calls/... and results/final-calls/{group}/...; updated rule input signatures where patterns changed.
Path refactor — tables & maf
workflow/rules/datavzrd.smk, workflow/rules/maf.smk, workflow/rules/table.smk
Added nested {group} directory to table/MAF/fusions input and output templates (e.g., results/tables/{group}/{group}.*, results/maf/{group}/{group}.*).
Path refactor — testcase / observations & docs
workflow/rules/testcase.smk, workflow/Snakefile
Reordered gather_observations path to results/observations/{caller}/{group}/... and added explanatory header comments in Snakefile; logs adjusted accordingly.

Sequence Diagram(s)

(omitted — changes are path/template reorganizations; no control‑flow or feature additions to visualize)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay extra attention to workflow/rules/common.smk (new helper functions and their usages).
  • Verify producer/consumer path consistency for {caller} and {group} across calling, candidate, filtering, and final rules.
  • Check log/benchmark paths for CI/monitoring expectations and validate the trimming wrapper bump.

Possibly related PRs

Suggested reviewers

  • johanneskoester
  • dlaehnemann

Poem

🐇
I tunneled folders, neat and light,
Caller and group snug out of sight.
Paths now nest like rows of cheer,
Logs and files hop close and near. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat: split output files into subdirectories' accurately reflects the main changes in the PR, which extensively reorganize workflow output paths by introducing nested group-level and caller-specific subdirectories across multiple rule files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/branch_output

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5fc05e and 85de9c9.

📒 Files selected for processing (5)
  • .github/workflows/main.yml (1 hunks)
  • workflow/rules/annotation.smk (5 hunks)
  • workflow/rules/common.smk (17 hunks)
  • workflow/rules/datavzrd.smk (3 hunks)
  • workflow/rules/fusion_calling.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/main.yml
  • workflow/rules/datavzrd.smk
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-11-25T12:34:09.018Z
Learnt from: johanneskoester
Repo: snakemake-workflows/dna-seq-varlociraptor PR: 338
File: workflow/rules/candidate_calling.smk:86-87
Timestamp: 2024-11-25T12:34:09.018Z
Learning: In this workflow, the grouping of variant calling rules is intentional, and not all related variant calling rules need to be grouped under "calling".

Applied to files:

  • workflow/rules/fusion_calling.smk
  • workflow/rules/common.smk
  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: test primers
  • GitHub Check: test no candidate filtering
  • GitHub Check: test testcase generation
  • GitHub Check: test target regions, multiple BEDs
  • GitHub Check: test local input
  • GitHub Check: test sra download
  • GitHub Check: test target regions
🔇 Additional comments (26)
workflow/rules/fusion_calling.smk (4)

71-71: LGTM! Path restructuring is consistent.

The output path now nests files under {sample} subdirectories, which aligns with the PR objectives to reduce file clutter.


84-86: LGTM! Input/output paths are correctly chained.

The sort rule correctly references the convert_fusions output path, and the nested directory structure is consistent.


93-93: LGTM! Log path follows the new directory structure.

The log path correctly mirrors the nested {sample}/{sample} pattern used for outputs.


105-105: LGTM! Group-level concatenation path is consistent.

The output path uses the {group}/{group} nesting pattern, consistent with the sample-level nesting above.

workflow/rules/annotation.smk (6)

3-10: LGTM! Candidate annotation paths are consistent.

The input and output paths correctly use the nested {caller}/{group}/{group}.{scatteritem} structure, maintaining consistency across the annotation pipeline.


17-19: LGTM! Log and benchmark paths follow the new structure.

The paths correctly use {group}/{group}.{caller}.{scatteritem} nesting for logs and benchmarks.


29-38: LGTM! Variant annotation pipeline paths are consistent.

The input from varlociraptor and outputs to vep_annotated both use the {group}/{group} nesting pattern consistently.


58-62: LGTM! VCF annotation paths maintain the prefix wildcard.

The paths correctly use the {prefix} wildcard, allowing flexible matching of the nested directory structures introduced by this PR.


79-83: LGTM! DGIdb annotation uses the new helper function.

The use of get_annotate_dgidb_input abstracts the conditional logic for selecting between db_annotated and vep_annotated paths.


109-111: LGTM! Final gathering step uses nested paths.

The output and log paths correctly use {group}/{group} nesting for the final annotated calls.

workflow/rules/common.smk (16)

185-207: LGTM! Final output paths correctly use group-level nesting.

The paths now use results/final-calls/{group}/{group}.{event}.{calling_type} pattern, which is consistent with the PR's goal of organizing files into subdirectories.


212-221: LGTM! MAF output paths follow the same nesting pattern.

The MAF outputs correctly use the {group}/{group} nesting structure, consistent with the final-calls paths.


226-248: LGTM! Table output paths use consistent nesting.

Both TSV and XLSX table outputs correctly use the {group}/{group} nesting pattern.


261-286: LGTM! Filtered and control FDR paths are consistent.

The filtered paths now include {group}/{event}/{group} nesting, and the control FDR input paths correctly reference the new structure.


596-599: LGTM! Arriba candidates path matches fusion_calling.smk.

The path results/calls/candidates/arriba/{sample}/{sample}.bcf is consistent with the changes in fusion_calling.smk (lines 86, 105).


618-633: LGTM! Observation paths use caller/group/sample nesting.

The nested structure results/observations/{caller}/{group}/{sample} provides clear organization of observation files by caller and group.


707-712: LGTM! Scattered calls path uses appropriate nesting.

The path results/calls/varlociraptor/{caller}/{group}/{group}.{scatteritem} correctly organizes scattered calls by caller and group.


714-721: LGTM! Helper function correctly abstracts annotation path selection.

The get_annotate_dgidb_input function cleanly encapsulates the logic for selecting between db_annotated and vep_annotated paths based on configuration.


724-730: LGTM! Annotation selection logic is clear and correct.

The get_final_selected_annotation function correctly implements the precedence: vep_annotated (default) → db_annotated (if VCF annotations active) → dgidb_annotated (if DGIdb active).


735-745: LGTM! Annotated BCF path construction uses the new helper.

The function correctly calls get_final_selected_annotation() for variants and falls back to varlociraptor for fusions, constructing paths with the {group}/{group} nesting.


748-759: LGTM! Gather function uses consistent annotation selection.

The get_gather_annotated_calls_input function correctly uses get_final_selected_annotation() and constructs paths with proper nesting.


762-767: LGTM! Candidate calls paths handle filtered and unfiltered cases.

The function correctly returns paths with {caller}/{group}/{group} nesting, supporting both filtered and unfiltered candidates.


808-813: LGTM! Merge calls input uses group/event nesting.

The path results/calls/fdr-controlled/{group}/{event}/{group} provides clear organization of FDR-controlled calls.


878-887: LGTM! Fixed candidate calls paths are consistent.

The function handles both Delly's special case (no_bnds) and the general case with consistent {group}/{group} nesting.


1510-1520: LGTM! Datavzrd data paths use group-level nesting.

The table paths correctly use results/tables/{group}/{group}.{event} nesting, consistent with other table outputs.


1526-1529: LGTM! Oncoprint input paths are consistent.

The table paths for oncoprint input correctly use the {group}/{group} nesting pattern.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@FelixMoelder FelixMoelder marked this pull request as ready for review November 10, 2025 08:45
@FelixMoelder FelixMoelder changed the title feat: divide output files into subdirectories feat: split output files into subdirectories Nov 10, 2025
Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description of the naming philosophy to the Snakefile? As a comment? Basically what can be found where.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants