Skip to content

Commit 0d2c92a

Browse files
feat: automate inference of index name (#1169)
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> ### Description <!-- Add a description of your PR here--> Added automated inference of index name and to call `MarkDuplicatesWithMateCigar` instead of the default, since both tools are quite similar. These changes render the wrapper `bio/picard/markduplicateswithmatecigar` redundant, ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: David Laehnemann <[email protected]>
1 parent 00b9b1c commit 0d2c92a

File tree

14 files changed

+89
-110
lines changed

14 files changed

+89
-110
lines changed
Binary file not shown.
Binary file not shown.

bio/picard/markduplicates/environment.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@ channels:
33
- bioconda
44
- nodefaults
55
dependencies:
6-
- picard =2.27.4
6+
- picard =3.0.0
77
- samtools =1.16.1
8-
- snakemake-wrapper-utils =0.5.2
8+
- snakemake-wrapper-utils =0.5.3

bio/picard/markduplicates/meta.yaml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,19 @@
11
name: picard MarkDuplicates
22
description: |
3-
Mark PCR and optical duplicates with picard tools. For more information about MarkDuplicates see `picard documentation <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates>`_.
3+
Mark PCR and optical duplicates with picard tools.
4+
url: https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
45
authors:
56
- Johannes Köster
67
- Christopher Schröder
8+
- Filipe G. Vieira
79
input:
810
- bam/cram file(s)
911
output:
1012
- bam/cram file with marked or removed duplicates
13+
params:
14+
- java_opts: allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
15+
- extra: allows for additional program arguments.
16+
- embed_ref: allows to embed the fasta reference into the cram
17+
- withmatecigar: allows to run `MarkDuplicatesWithMateCigar <https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar>`_ instead.
1118
notes: |
12-
* The `java_opts` param allows for additional arguments to be passed to the java compiler, e.g. "-XX:ParallelGCThreads=10" (not for `-XmX` or `-Djava.io.tmpdir`, since they are handled automatically).
13-
* The `extra` param allows for additional program arguments.
1419
* `--TMP_DIR` is automatically set by `resources.tmpdir`
15-
* For more information see, https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
Lines changed: 32 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
rule mark_duplicates:
1+
rule markduplicates_bam:
22
input:
33
bams="mapped/{sample}.bam",
44
# optional to specify a list of BAMs; this has the same effect
55
# of marking duplicates on separate read groups for a sample
66
# and then merging
77
output:
8-
bam="dedup/{sample}.bam",
9-
metrics="dedup/{sample}.metrics.txt",
8+
bam="dedup_bam/{sample}.bam",
9+
metrics="dedup_bam/{sample}.metrics.txt",
1010
log:
11-
"logs/picard/dedup/{sample}.log",
11+
"logs/dedup_bam/{sample}.log",
1212
params:
1313
extra="--REMOVE_DUPLICATES true",
1414
# optional specification of memory usage of the JVM that snakemake will respect with global
@@ -21,26 +21,39 @@ rule mark_duplicates:
2121
"master/bio/picard/markduplicates"
2222

2323

24-
rule mark_duplicates_cram:
24+
use rule markduplicates_bam as markduplicateswithmatecigar_bam with:
25+
output:
26+
bam="dedup_bam/{sample}.matecigar.bam",
27+
idx="dedup_bam/{sample}.mcigar.bai",
28+
metrics="dedup_bam/{sample}.matecigar.metrics.txt",
29+
log:
30+
"logs/dedup_bam/{sample}.matecigar.log",
31+
params:
32+
withmatecigar=True,
33+
extra="--REMOVE_DUPLICATES true",
34+
35+
36+
use rule markduplicates_bam as markduplicates_sam with:
37+
output:
38+
bam="dedup_sam/{sample}.sam",
39+
metrics="dedup_sam/{sample}.metrics.txt",
40+
log:
41+
"logs/dedup_sam/{sample}.log",
42+
params:
43+
extra="--REMOVE_DUPLICATES true",
44+
45+
46+
use rule markduplicates_bam as markduplicates_cram with:
2547
input:
2648
bams="mapped/{sample}.bam",
2749
ref="ref/genome.fasta",
28-
# optional to specify a list of BAMs; this has the same effect
29-
# of marking duplicates on separate read groups for a sample
30-
# and then merging
3150
output:
32-
bam="dedup/{sample}.cram",
33-
metrics="dedup/{sample}.metrics.txt",
51+
bam="dedup_cram/{sample}.cram",
52+
idx="dedup_cram/{sample}.cram.crai",
53+
metrics="dedup_cram/{sample}.metrics.txt",
3454
log:
35-
"logs/picard/dedup/{sample}.log",
55+
"logs/dedup_cram/{sample}.log",
3656
params:
3757
extra="--REMOVE_DUPLICATES true",
3858
embed_ref=True, # set true if the fasta reference should be embedded into the cram
39-
# optional specification of memory usage of the JVM that snakemake will respect with global
40-
# resource restrictions (https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#resources)
41-
# and which can be used to request RAM during cluster job submission as `{resources.mem_mb}`:
42-
# https://snakemake.readthedocs.io/en/latest/executing/cluster.html#job-properties
43-
resources:
44-
mem_mb=1024,
45-
wrapper:
46-
"master/bio/picard/markduplicates"
59+
withmatecigar=False,

bio/picard/markduplicates/wrapper.py

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,40 +5,60 @@
55

66

77
import tempfile
8+
from pathlib import Path
89
from snakemake.shell import shell
910
from snakemake_wrapper_utils.java import get_java_opts
11+
from snakemake_wrapper_utils.samtools import get_samtools_opts, infer_out_format
1012

11-
log = snakemake.log_fmt_shell()
1213

14+
log = snakemake.log_fmt_shell()
1315
extra = snakemake.params.get("extra", "")
1416
# the --SORTING_COLLECTION_SIZE_RATIO default of 0.25 might
1517
# indicate a JVM memory overhead of that proportion
1618
java_opts = get_java_opts(snakemake, java_mem_overhead_factor=0.3)
19+
samtools_opts = get_samtools_opts(snakemake)
20+
21+
22+
tool = "MarkDuplicates"
23+
if snakemake.params.get("withmatecigar", False):
24+
tool = "MarkDuplicatesWithMateCigar"
25+
1726

1827
bams = snakemake.input.bams
1928
if isinstance(bams, str):
2029
bams = [bams]
2130
bams = list(map("--INPUT {}".format, bams))
2231

23-
if snakemake.output.bam.endswith(".cram"):
32+
33+
output = snakemake.output.bam
34+
output_fmt = infer_out_format(output)
35+
convert = ""
36+
if output_fmt == "CRAM":
2437
output = "/dev/stdout"
25-
if snakemake.params.embed_ref:
26-
view_options = "-O cram,embed_ref"
27-
else:
28-
view_options = "-O cram"
29-
convert = f" | samtools view -@ {snakemake.threads} {view_options} --reference {snakemake.input.ref} -o {snakemake.output.bam}"
30-
else:
31-
output = snakemake.output.bam
32-
convert = ""
38+
39+
# NOTE: output format inference should be done by snakemake-wrapper-utils. Keeping it here for backwards compatibility.
40+
if snakemake.params.get("embed_ref", False):
41+
samtools_opts += ",embed_ref"
42+
43+
convert = f" | samtools view {samtools_opts}"
44+
elif output_fmt == "BAM" and snakemake.output.get("idx"):
45+
extra += " --CREATE_INDEX"
46+
3347

3448
with tempfile.TemporaryDirectory() as tmpdir:
3549
shell(
36-
"(picard MarkDuplicates" # Tool and its subcommand
50+
"(picard {tool}" # Tool and its subcommand
3751
" {java_opts}" # Automatic java option
3852
" {extra}" # User defined parmeters
3953
" {bams}" # Input bam(s)
4054
" --TMP_DIR {tmpdir}"
4155
" --OUTPUT {output}" # Output bam
4256
" --METRICS_FILE {snakemake.output.metrics}" # Output metrics
43-
" {convert} ) {log}" # Logging
57+
" {convert}) {log}" # Logging
4458
)
59+
60+
61+
output_prefix = Path(snakemake.output.bam).with_suffix("")
62+
if snakemake.output.get("idx"):
63+
if output_fmt == "BAM" and snakemake.output.idx != str(output_prefix) + ".bai":
64+
shell("mv {output_prefix}.bai {snakemake.output.idx}")

bio/picard/markduplicateswithmatecigar/environment.yaml

Lines changed: 0 additions & 7 deletions
This file was deleted.

bio/picard/markduplicateswithmatecigar/meta.yaml

Lines changed: 0 additions & 15 deletions
This file was deleted.

bio/picard/markduplicateswithmatecigar/test/Snakefile

Lines changed: 0 additions & 18 deletions
This file was deleted.

0 commit comments

Comments
 (0)