Skip to content

Commit 2348055

Browse files
authored
Merge pull request #1 from MPUSP/dev
feat: complete minimal workflow as template
2 parents 4d7b3a2 + 776b97e commit 2348055

20 files changed

+692
-11
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ jobs:
4646
with:
4747
directory: .test
4848
snakefile: workflow/Snakefile
49-
args: "--use-conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache --all-temp"
49+
args: "--use-conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"
5050

5151
- name: Test report
5252
uses: snakemake/[email protected]

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ resources/**
33
logs/**
44
.snakemake
55
.snakemake/**
6+
.test/results/*

.test/config/config.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
samplesheet: "config/samples.tsv"
2+
3+
get_genome:
4+
database: "ncbi"
5+
assembly: "GCF_000006785.2"
6+
fasta: Null
7+
gff: Null
8+
gff_source_type:
9+
[
10+
"RefSeq": "gene",
11+
"RefSeq": "pseudogene",
12+
"RefSeq": "CDS",
13+
"Protein Homology": "CDS",
14+
]
15+
16+
simulate_reads:
17+
read_length: 100
18+
read_number: 100000
19+
random_freq: 0.01
20+
21+
cutadapt:
22+
threep_adapter: "-a ATCGTAGATCGG"
23+
fivep_adapter: "-A GATGGCGATAGG"
24+
default: ["-q 10 ", "-m 25 ", "-M 100", "--overlap=5"]
25+
26+
multiqc:
27+
config: "config/multiqc_config.yml"

.test/config/multiqc_config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
remove_sections:
2+
- samtools-stats

.test/config/samples.tsv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
sample condition replicate read1 read2
2+
sample1 wild_type 1 sample1.bwa.read1.fastq.gz sample1.bwa.read2.fastq.gz
3+
sample2 wild_type 2 sample2.bwa.read1.fastq.gz sample2.bwa.read2.fastq.gz

README.md

Lines changed: 106 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,122 @@
11
# Snakemake workflow: `<name>`
22

3-
[![Snakemake](https://img.shields.io/badge/snakemake-≥6.3.0-brightgreen.svg)](https://snakemake.github.io)
4-
[![GitHub actions status](https://github.com/<owner>/<repo>/workflows/Tests/badge.svg?branch=main)](https://github.com/<owner>/<repo>/actions?query=branch%3Amain+workflow%3ATests)
5-
3+
[![Snakemake](https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg)](https://snakemake.github.io)
4+
[![GitHub actions status](https://github.com/MPUSP/snakemake-workflow-template/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/MPUSP/snakemake-workflow-template/actions/workflows/main.yml)
5+
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
6+
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1D355C.svg?labelColor=000000)](https://sylabs.io/docs/)
7+
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog)
68

79
A Snakemake workflow for `<description>`
810

11+
- [Snakemake workflow: `<name>`](#snakemake-workflow-name)
12+
- [Usage](#usage)
13+
- [Workflow overview](#workflow-overview)
14+
- [Running the workflow](#running-the-workflow)
15+
- [Input data](#input-data)
16+
- [Execution](#execution)
17+
- [Parameters](#parameters)
18+
- [Authors](#authors)
19+
- [References](#references)
20+
- [TODO](#todo)
921

1022
## Usage
1123

1224
The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/?usage=<owner>%2F<repo>).
1325

14-
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) <repo>sitory and its DOI (see above).
26+
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository or its DOI.
27+
28+
## Workflow overview
29+
30+
This workflow is a best-practice workflow for `<detailed description>`.
31+
The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and consists of the following steps:
32+
33+
1. Parse sample sheet containing sample meta data (`python`)
34+
2. Simulate short read sequencing data on the fly (`dwgsim`)
35+
3. Check quality of input read data (`FastQC`)
36+
4. Trim adapters from input data (`cutadapt`)
37+
5. Collect statistics from tool output (`MultiQC`)
38+
39+
## Running the workflow
40+
41+
### Input data
42+
43+
This template workflow creates artificial sequencing data in `*.fastq.gz` format. It does not contain actual input data. The simulated input files are nevertheless created based on a mandatory table linked in the `config.yml` file (default: `.test/samples.tsv`). The sample sheet has the following layout:
44+
45+
| sample | condition | replicate | read1 | read2 |
46+
| ------- | --------- | --------- | -------------------------- | -------------------------- |
47+
| sample1 | wild_type | 1 | sample1.bwa.read1.fastq.gz | sample1.bwa.read2.fastq.gz |
48+
| sample2 | wild_type | 2 | sample2.bwa.read1.fastq.gz | sample2.bwa.read2.fastq.gz |
49+
50+
51+
### Execution
52+
53+
To run the workflow from command line, change the working directory.
54+
55+
```bash
56+
cd path/to/snakemake-workflow-name
57+
```
58+
59+
Adjust options in the default config file `config/config.yml`.
60+
Before running the entire workflow, you can perform a dry run using:
61+
62+
```bash
63+
snakemake --dry-run
64+
```
65+
66+
To run the complete workflow with test files using **conda**, execute the following command. The definition of the number of compute cores is mandatory.
67+
68+
```bash
69+
snakemake --cores 3 --sdm conda --directory .test
70+
```
71+
72+
To run the workflow with **singularity** / **apptainer**, add a link to a container registry in the `Snakefile`, for example:
73+
`container: "oras://ghcr.io/<user>/<repository>:<version>"` for Github's container registry. Run the workflow with:
74+
75+
```bash
76+
snakemake --cores 3 --sdm conda apptainer --directory .test
77+
```
78+
79+
### Parameters
80+
81+
This table lists all parameters that can be used to run the workflow.
82+
83+
| parameter | type | details | default |
84+
| ------------------ | ---- | --------------------------------------- | --------------------------------------------- |
85+
| **samplesheet** | | | |
86+
| path | str | path to samplesheet, mandatory | "config/samples.tsv" |
87+
| **get_genome** | | | |
88+
| database | str | one of `manual`, `ncbi` | `ncbi` |
89+
| assembly | str | RefSeq ID | `GCF_000006785.2` |
90+
| fasta | str | optional path to fasta file | Null |
91+
| gff | str | optional path to gff file | Null |
92+
| gff_source_type | str | list of name/value pairs for GFF source | see config file |
93+
| **simulate_reads** | | | |
94+
| read_length | num | length of target reads in bp | 100 |
95+
| read_number | num | number of total reads to be simulated | 100000 |
96+
| random_freq | num | frequency of random read sequences | 0.01 |
97+
| **cutadapt** | | | |
98+
| threep_adapter | str | sequence of the 3' adapter | `-a ATCGTAGATCGG` |
99+
| fivep_adapter | str | sequence of the 5' adapter | `-A GATGGCGATAGG` |
100+
| default | str | additional options passed to `cutadapt` | [`-q 10 `, `-m 25 `, `-M 100`, `--overlap=5`] |
101+
| **multiqc** | | | |
102+
| config | str | path to multiQC config | `config/multiqc_config.yml` |
103+
104+
## Authors
105+
106+
- Firstname Lastname
107+
- Affiliation
108+
- ORCID profile
109+
- home page
110+
111+
## References
112+
113+
> Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. *Sustainable data analysis with Snakemake*. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.
15114
16-
# TODO
115+
## TODO
17116

18117
* Replace `<owner>` and `<repo>` everywhere in the template (also under .github/workflows) with the correct `<repo>` name and owning user or organization.
19118
* Replace `<name>` with the workflow name (can be the same as `<repo>`).
20119
* Replace `<description>` with a description of what the workflow does.
120+
* Update the workflow description, parameters, running options, authors and references in the `README.md`
121+
* Update the `README.md` badges. Add or remove badges for `conda`/`singularity`/`apptainer` usage depending on the workflow's capability
21122
* The workflow will occur in the snakemake-workflow-catalog once it has been made public. Then the link under "Usage" will point to the usage instructions if `<owner>` and `<repo>` were correctly set.

config/README.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,82 @@
1-
Describe how to configure the workflow (using config.yaml and maybe additional files).
2-
All of them need to be present with example entries inside of the config folder.
1+
## Workflow overview
2+
3+
This workflow is a best-practice workflow for `<detailed description>`.
4+
The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and consists of the following steps:
5+
6+
1. Parse sample sheet containing sample meta data (`python`)
7+
2. Simulate short read sequencing data on the fly (`dwgsim`)
8+
3. Check quality of input read data (`FastQC`)
9+
4. Trim adapters from input data (`cutadapt`)
10+
5. Collect statistics from tool output (`MultiQC`)
11+
12+
## Running the workflow
13+
14+
### Input data
15+
16+
This template workflow creates artificial sequencing data in `*.fastq.gz` format. It does not contain actual input data. The simulated input files are nevertheless created based on a mandatory table linked in the `config.yml` file (default: `.test/samples.tsv`). The sample sheet has the following layout:
17+
18+
| sample | condition | replicate | read1 | read2 |
19+
| ------- | --------- | --------- | -------------------------- | -------------------------- |
20+
| sample1 | wild_type | 1 | sample1.bwa.read1.fastq.gz | sample1.bwa.read2.fastq.gz |
21+
| sample2 | wild_type | 2 | sample2.bwa.read1.fastq.gz | sample2.bwa.read2.fastq.gz |
22+
23+
24+
### Execution
25+
26+
To run the workflow from command line, change the working directory.
27+
28+
```bash
29+
cd path/to/snakemake-workflow-name
30+
```
31+
32+
Adjust options in the default config file `config/config.yml`.
33+
Before running the entire workflow, you can perform a dry run using:
34+
35+
```bash
36+
snakemake --dry-run
37+
```
38+
39+
To run the complete workflow with test files using **conda**, execute the following command. The definition of the number of compute cores is mandatory.
40+
41+
```bash
42+
snakemake --cores 3 --sdm conda --directory .test
43+
```
44+
45+
To run the workflow with **singularity** / **apptainer**, add a link to a container registry in the `Snakefile`, for example:
46+
`container: "oras://ghcr.io/<user>/<repository>:<version>"` for Github's container registry. Run the workflow with:
47+
48+
```bash
49+
snakemake --cores 3 --sdm conda apptainer --directory .test
50+
```
51+
52+
### Parameters
53+
54+
This table lists all parameters that can be used to run the workflow.
55+
56+
| parameter | type | details | default |
57+
| ------------------ | ---- | --------------------------------------- | --------------------------------------------- |
58+
| **samplesheet** | | | |
59+
| path | str | path to samplesheet, mandatory | "config/samples.tsv" |
60+
| **get_genome** | | | |
61+
| database | str | one of `manual`, `ncbi` | `ncbi` |
62+
| assembly | str | RefSeq ID | `GCF_000006785.2` |
63+
| fasta | str | optional path to fasta file | Null |
64+
| gff | str | optional path to gff file | Null |
65+
| gff_source_type | str | list of name/value pairs for GFF source | see config file |
66+
| **simulate_reads** | | | |
67+
| read_length | num | length of target reads in bp | 100 |
68+
| read_number | num | number of total reads to be simulated | 100000 |
69+
| random_freq | num | frequency of random read sequences | 0.01 |
70+
| **cutadapt** | | | |
71+
| threep_adapter | str | sequence of the 3' adapter | `-a ATCGTAGATCGG` |
72+
| fivep_adapter | str | sequence of the 5' adapter | `-A GATGGCGATAGG` |
73+
| default | str | additional options passed to `cutadapt` | [`-q 10 `, `-m 25 `, `-M 100`, `--overlap=5`] |
74+
| **multiqc** | | | |
75+
| config | str | path to multiQC config | `config/multiqc_config.yml` |
76+
77+
## TODO
78+
79+
* Replace `<owner>` and `<repo>` everywhere in the template (also under .github/workflows) with the correct `<repo>` name and owning user or organization.
80+
* Replace `<name>` with the workflow name (can be the same as `<repo>`).
81+
* Replace `<description>` with a description of what the workflow does.
82+
* Update the workflow parameters and running options

config/config.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
samplesheet: ".test/config/samples.tsv"
2+
3+
get_genome:
4+
database: "ncbi"
5+
assembly: "GCF_000006785.2"
6+
fasta: Null
7+
gff: Null
8+
gff_source_type:
9+
[
10+
"RefSeq": "gene",
11+
"RefSeq": "pseudogene",
12+
"RefSeq": "CDS",
13+
"Protein Homology": "CDS",
14+
]
15+
16+
simulate_reads:
17+
read_length: 100
18+
read_number: 100000
19+
random_freq: 0.01
20+
21+
cutadapt:
22+
threep_adapter: "-a ATCGTAGATCGG"
23+
fivep_adapter: "-A GATGGCGATAGG"
24+
default: ["-q 10 ", "-m 25 ", "-M 100", "--overlap=5"]
25+
26+
multiqc:
27+
config: "config/multiqc_config.yml"

config/multiqc_config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
remove_sections:
2+
- samtools-stats

config/schemas/config.schema.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
$schema: "http://json-schema.org/draft-07/schema#"
2+
description: an entry in the sample sheet
3+
properties:
4+
samplesheet:
5+
type: string
6+
description: sample name/identifier
7+
8+
get_genome:
9+
properties:
10+
database:
11+
type: ["string", "null"]
12+
assembly:
13+
type: ["string", "null"]
14+
fasta:
15+
type: ["string", "null"]
16+
gff:
17+
type: ["string", "null"]
18+
gff_source_type:
19+
type: array
20+
21+
simulate_reads:
22+
properties:
23+
read_length:
24+
type: number
25+
read_number:
26+
type: number
27+
random_freq:
28+
type: number
29+
30+
cutadapt:
31+
properties:
32+
threep_adapter:
33+
type: string
34+
fivep_adapter:
35+
type: string
36+
default:
37+
type: array
38+
39+
multiqc:
40+
properties:
41+
config:
42+
type: string
43+
44+
required: ["samplesheet", "get_genome", "simulate_reads", "cutadapt", "multiqc"]

0 commit comments

Comments
 (0)