feat: split output files into subdirectories #398

FelixMoelder · 2025-10-31T11:08:34Z

In its current configuration the workflow writes a lot of files into single folders (see #397). This involves especially the results/calls, results/candidate-calls and results/final-calls directories.
To reduce the amount of files the results/calls directory will contain several new subdirectories for each proccessing step of bcf files.
In addition there will also be subdirectories for each group as scatteritems still lead to a lot of files.

Summary by CodeRabbit

Chores
- Reorganized workflow outputs, inputs and logs into a deeper group- and caller-scoped directory layout (affects where results and logs are written/read).
- Updated Free Disk Space GitHub Action to v1.3.1.
- Updated SRA data-retrieval tool wrapper to a newer version.
Documentation
- Added top-level workflow header comments describing produced file organization.

coderabbitai · 2025-10-31T11:08:44Z

Walkthrough

Restructured Snakemake file and log paths to nest {caller} and {group} directories across many rules, added two helper functions, bumped the Free Disk Space GitHub Action version, and updated the sra-tools/fasterq-dump wrapper. No algorithmic or control-flow changes.

Changes

Cohort / File(s)	Summary
GitHub Actions `\.github/workflows/main.yml`	Bumped Free Disk Space Action from `v1.3.0` → `v1.3.1`.
Wrapper update `workflow/rules/trimming.smk`	Updated sra-tools/fasterq-dump wrapper from `v5.0.2` → `v7.6.0`.
Path refactor — annotation `workflow/rules/annotation.smk`	Updated input/output/log/benchmark templates to use nested `{caller}` and `{group}` directories and adjusted annotated/db/dgidb output locations.
Path refactor — calling `workflow/rules/calling.smk`	Reworked varlociraptor and related rule paths to nest `{caller}/{group}`; adjusted corresponding logs and benchmarks.
Path refactor — candidate calling `workflow/rules/candidate_calling.smk`	Moved candidate outputs into `results/calls/candidates/{caller}/{group}/...`; updated freebayes, delly, fix_delly_calls, filter_offtarget_variants, scatter_candidates inputs/outputs/logs.
Path refactor — fusion calling `workflow/rules/fusion_calling.smk`	Moved Arriba VCF/BCF outputs into `results/calls/candidates/arriba/{sample}/{sample}.*` and group concat into `results/calls/candidates/arriba/{group}/{group}.bcf`.
Path refactor — common helpers `workflow/rules/common.smk`	Adjusted many path templates to new nested layout; added `get_annotate_dgidb_input(wildcards)` and `get_final_selected_annotation()` utility functions and updated callers.
Path refactor — filtering & final outputs `workflow/rules/filtering.smk`, `workflow/rules/mutational_burden.smk`, `workflow/rules/population.smk`	Redirected inputs/outputs/logs to nested `results/calls/...` and `results/final-calls/{group}/...`; updated rule input signatures where patterns changed.
Path refactor — tables & maf `workflow/rules/datavzrd.smk`, `workflow/rules/maf.smk`, `workflow/rules/table.smk`	Added nested `{group}` directory to table/MAF/fusions input and output templates (e.g., `results/tables/{group}/{group}.`, `results/maf/{group}/{group}.`).
Path refactor — testcase / observations & docs `workflow/rules/testcase.smk`, `workflow/Snakefile`	Reordered gather_observations path to `results/observations/{caller}/{group}/...` and added explanatory header comments in `Snakefile`; logs adjusted accordingly.

Sequence Diagram(s)

(omitted — changes are path/template reorganizations; no control‑flow or feature additions to visualize)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay extra attention to workflow/rules/common.smk (new helper functions and their usages).
Verify producer/consumer path consistency for {caller} and {group} across calling, candidate, filtering, and final rules.
Check log/benchmark paths for CI/monitoring expectations and validate the trimming wrapper bump.

Possibly related PRs

perf: update vep wrappers #401 — touches annotation rules (annotate_candidate_variants / annotate_variants); related to annotation path/layout changes.
perf: group calling jobs per patient/sample-group such that they are submitted to the same cluster/cloud nodes in order to save I/O #338 — modifies calling and trimming wrappers; related to calling-path restructures and wrapper updates.
fix: fix excluding events from the report #332 — adjusts common path-building helpers; related to common.smk helper-function changes.

Suggested reviewers

johanneskoester
dlaehnemann

Poem

🐇
I tunneled folders, neat and light,
Caller and group snug out of sight.
Paths now nest like rows of cheer,
Logs and files hop close and near. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'feat: split output files into subdirectories' accurately reflects the main changes in the PR, which extensively reorganize workflow output paths by introducing nested group-level and caller-specific subdirectories across multiple rule files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/branch_output

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5fc05e and 85de9c9.

📒 Files selected for processing (5)

.github/workflows/main.yml (1 hunks)
workflow/rules/annotation.smk (5 hunks)
workflow/rules/common.smk (17 hunks)
workflow/rules/datavzrd.smk (3 hunks)
workflow/rules/fusion_calling.smk (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.github/workflows/main.yml
workflow/rules/datavzrd.smk

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-11-25T12:34:09.018Z

Learnt from: johanneskoester
Repo: snakemake-workflows/dna-seq-varlociraptor PR: 338
File: workflow/rules/candidate_calling.smk:86-87
Timestamp: 2024-11-25T12:34:09.018Z
Learning: In this workflow, the grouping of variant calling rules is intentional, and not all related variant calling rules need to be grouped under "calling".

Applied to files:

workflow/rules/fusion_calling.smk
workflow/rules/common.smk
workflow/rules/annotation.smk

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: test primers
GitHub Check: test no candidate filtering
GitHub Check: test testcase generation
GitHub Check: test target regions, multiple BEDs
GitHub Check: test local input
GitHub Check: test sra download
GitHub Check: test target regions

🔇 Additional comments (26)

workflow/rules/fusion_calling.smk (4)

71-71: LGTM! Path restructuring is consistent.

The output path now nests files under {sample} subdirectories, which aligns with the PR objectives to reduce file clutter.

84-86: LGTM! Input/output paths are correctly chained.

The sort rule correctly references the convert_fusions output path, and the nested directory structure is consistent.

93-93: LGTM! Log path follows the new directory structure.

The log path correctly mirrors the nested {sample}/{sample} pattern used for outputs.

105-105: LGTM! Group-level concatenation path is consistent.

The output path uses the {group}/{group} nesting pattern, consistent with the sample-level nesting above.

workflow/rules/annotation.smk (6)

3-10: LGTM! Candidate annotation paths are consistent.

The input and output paths correctly use the nested {caller}/{group}/{group}.{scatteritem} structure, maintaining consistency across the annotation pipeline.

17-19: LGTM! Log and benchmark paths follow the new structure.

The paths correctly use {group}/{group}.{caller}.{scatteritem} nesting for logs and benchmarks.

29-38: LGTM! Variant annotation pipeline paths are consistent.

The input from varlociraptor and outputs to vep_annotated both use the {group}/{group} nesting pattern consistently.

58-62: LGTM! VCF annotation paths maintain the prefix wildcard.

The paths correctly use the {prefix} wildcard, allowing flexible matching of the nested directory structures introduced by this PR.

79-83: LGTM! DGIdb annotation uses the new helper function.

The use of get_annotate_dgidb_input abstracts the conditional logic for selecting between db_annotated and vep_annotated paths.

109-111: LGTM! Final gathering step uses nested paths.

The output and log paths correctly use {group}/{group} nesting for the final annotated calls.

workflow/rules/common.smk (16)

185-207: LGTM! Final output paths correctly use group-level nesting.

The paths now use results/final-calls/{group}/{group}.{event}.{calling_type} pattern, which is consistent with the PR's goal of organizing files into subdirectories.

212-221: LGTM! MAF output paths follow the same nesting pattern.

The MAF outputs correctly use the {group}/{group} nesting structure, consistent with the final-calls paths.

226-248: LGTM! Table output paths use consistent nesting.

Both TSV and XLSX table outputs correctly use the {group}/{group} nesting pattern.

261-286: LGTM! Filtered and control FDR paths are consistent.

The filtered paths now include {group}/{event}/{group} nesting, and the control FDR input paths correctly reference the new structure.

596-599: LGTM! Arriba candidates path matches fusion_calling.smk.

The path results/calls/candidates/arriba/{sample}/{sample}.bcf is consistent with the changes in fusion_calling.smk (lines 86, 105).

618-633: LGTM! Observation paths use caller/group/sample nesting.

The nested structure results/observations/{caller}/{group}/{sample} provides clear organization of observation files by caller and group.

707-712: LGTM! Scattered calls path uses appropriate nesting.

The path results/calls/varlociraptor/{caller}/{group}/{group}.{scatteritem} correctly organizes scattered calls by caller and group.

714-721: LGTM! Helper function correctly abstracts annotation path selection.

The get_annotate_dgidb_input function cleanly encapsulates the logic for selecting between db_annotated and vep_annotated paths based on configuration.

724-730: LGTM! Annotation selection logic is clear and correct.

The get_final_selected_annotation function correctly implements the precedence: vep_annotated (default) → db_annotated (if VCF annotations active) → dgidb_annotated (if DGIdb active).

735-745: LGTM! Annotated BCF path construction uses the new helper.

The function correctly calls get_final_selected_annotation() for variants and falls back to varlociraptor for fusions, constructing paths with the {group}/{group} nesting.

748-759: LGTM! Gather function uses consistent annotation selection.

The get_gather_annotated_calls_input function correctly uses get_final_selected_annotation() and constructs paths with proper nesting.

762-767: LGTM! Candidate calls paths handle filtered and unfiltered cases.

The function correctly returns paths with {caller}/{group}/{group} nesting, supporting both filtered and unfiltered candidates.

808-813: LGTM! Merge calls input uses group/event nesting.

The path results/calls/fdr-controlled/{group}/{event}/{group} provides clear organization of FDR-controlled calls.

878-887: LGTM! Fixed candidate calls paths are consistent.

The function handles both Delly's special case (no_bnds) and the general case with consistent {group}/{group} nesting.

1510-1520: LGTM! Datavzrd data paths use group-level nesting.

The table paths correctly use results/tables/{group}/{group}.{event} nesting, consistent with other table outputs.

1526-1529: LGTM! Oncoprint input paths are consistent.

The table paths for oncoprint input correctly use the {group}/{group} nesting pattern.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…na-seq-varlociraptor into feat/branch_output

johanneskoester

Can you add a description of the naming philosophy to the Snakefile? As a comment? Basically what can be found where.

Added comments to clarify file structure for BCF files.

FelixMoelder added 2 commits October 31, 2025 10:44

feat: branch output dirs

dcba13c

adjust input

b3be189

FelixMoelder added 9 commits October 31, 2025 11:30

add missing brace

5e21287

fix maf input

df70702

update sra tools

fb461c4

Update free-disk-space action to version 1.3.1

63dd5c0

Merge branch 'master' into feat/branch_output

93d5653

redirect calling, annotation and filtering output

2b4eacb

Merge branch 'feat/branch_output' of github.com:snakemake-workflows/d…

5809f0b

…na-seq-varlociraptor into feat/branch_output

change log path

8f0a5fd

fix testcase input

1c875db

FelixMoelder marked this pull request as ready for review November 10, 2025 08:45

FelixMoelder requested a review from johanneskoester November 10, 2025 08:45

split up filtering results

57e3e4c

FelixMoelder changed the title ~~feat: divide output files into subdirectories~~ feat: split output files into subdirectories Nov 10, 2025

johanneskoester requested changes Nov 10, 2025

View reviewed changes

FelixMoelder added 2 commits November 17, 2025 09:05

Document BCF file structure in Snakefile

e5fc05e

Added comments to clarify file structure for BCF files.

Merge branch 'master' into feat/branch_output

85de9c9

FelixMoelder requested a review from johanneskoester November 19, 2025 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: split output files into subdirectories #398

feat: split output files into subdirectories #398

Uh oh!

FelixMoelder commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

johanneskoester left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: split output files into subdirectories #398

Are you sure you want to change the base?

feat: split output files into subdirectories #398

Uh oh!

Conversation

FelixMoelder commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

johanneskoester left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FelixMoelder commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

johanneskoester left a comment •

edited

Loading