-
Notifications
You must be signed in to change notification settings - Fork 208
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat!: generalize away condition (#66)
* consistent ordering is ensured right after data import here, so this comment is obsolete * apply R lints * refactor variable names * apply more lints * overhaul config setup and make tests more complex * make deseq-init.R configurable via new config setup * apply lints to deseq2.R * actually use config.diffexp.model formula if specified * extend doc comments in config files * make deseq2.R contrast definitions use config.yaml specifications * fix rule align (star) parameters and make gtf an actual input * make samples.tsv and units.tsv complex enough for test case * add trimming explanation to config.yaml * overhaul config/README.md for snakemake workflow catalog explanations * apply lints to plot-pca.R * overhaul PCA plots (plots for all relevant variables, further one requestable, one separate plot per variable) * fix typos in .test/config_complex/config.yaml * fix syntax for complex contrasts * provide more useful error message for typos * add missing comma * add example adapters and strandedness entries to .test/config_basic/units.tsv to actually test the respective code * final check of config/README.md * snakefmt * fix schema for pca: labels: * use config_complex workflow for linting * fix formatting that is only caught in CI * fix cutadapt wrapper params from others: to extra: (originally suggest by @kilpert: f816550) * update actions, hopefully getting newest snakefmt to avoid spurious errors * add basic wildcard_constraints for safer sample and unit names (originally suggested by @kilpert: 5431e08) * snakefmt * fix copy-pasta oversight * fix config/README.md and units.tsv strandedness to read `none` * catch case of no batch_effects ("") in config
- Loading branch information
1 parent
896007e
commit e103c1c
Showing
23 changed files
with
453 additions
and
123 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,12 +12,12 @@ jobs: | |
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout with submodules | ||
uses: actions/checkout@v2 | ||
uses: actions/checkout@v3 | ||
with: | ||
submodules: recursive | ||
fetch-depth: 0 | ||
- name: Formatting | ||
uses: github/super-linter@v4 | ||
uses: github/super-linter@v5 | ||
env: | ||
VALIDATE_ALL_CODEBASE: false | ||
DEFAULT_BRANCH: master | ||
|
@@ -26,13 +26,13 @@ jobs: | |
linting: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/checkout@v3 | ||
- name: Linting | ||
uses: snakemake/[email protected] | ||
with: | ||
directory: .test | ||
snakefile: workflow/Snakefile | ||
args: "--lint" | ||
args: "--configfile .test/config_complex/config.yaml --lint" | ||
|
||
run-workflow: | ||
runs-on: ubuntu-latest | ||
|
@@ -41,18 +41,30 @@ jobs: | |
- formatting | ||
steps: | ||
- name: Checkout repository with submodules | ||
uses: actions/checkout@v2 | ||
uses: actions/checkout@v3 | ||
with: | ||
submodules: recursive | ||
- name: Test workflow | ||
- name: Test workflow (basic model, no batch_effects) | ||
uses: snakemake/[email protected] | ||
with: | ||
directory: .test | ||
snakefile: workflow/Snakefile | ||
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache" | ||
- name: Test report | ||
args: "--configfile .test/config_basic/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache" | ||
- name: Test report (basic model, no batch_effects) | ||
uses: snakemake/[email protected] | ||
with: | ||
directory: .test | ||
snakefile: workflow/Snakefile | ||
args: "--report report.zip" | ||
args: "--configfile .test/config_basic/config.yaml --report report.zip" | ||
- name: Test workflow (multiple variables_of_interest, include batch_effects) | ||
uses: snakemake/[email protected] | ||
with: | ||
directory: .test | ||
snakefile: workflow/Snakefile | ||
args: "--configfile .test/config_complex/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache" | ||
- name: Test report (multiple variables_of_interest, include batch_effects) | ||
uses: snakemake/[email protected] | ||
with: | ||
directory: .test | ||
snakefile: workflow/Snakefile | ||
args: "--configfile .test/config_complex/config.yaml --report report.zip" |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# path or URL to sample sheet (TSV format, columns: sample, condition, ...) | ||
samples: config_basic/samples.tsv | ||
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2) | ||
# Units are technical replicates (e.g. lanes, or resequencing of the same biological | ||
# sample). | ||
units: config_basic/units.tsv | ||
|
||
|
||
ref: | ||
# Ensembl species name | ||
species: saccharomyces_cerevisiae | ||
# Ensembl release | ||
release: 100 | ||
# Genome build | ||
build: R64-1-1 | ||
|
||
|
||
trimming: | ||
# If you activate trimming by setting this to `True`, you will have to | ||
# specify the respective cutadapt adapter trimming flag for each unit | ||
# in the `units.tsv` file's `adapters` column | ||
activate: True | ||
|
||
mergeReads: | ||
activate: False | ||
|
||
pca: | ||
activate: True | ||
# Per default, a separate PCA plot is generated for each of the | ||
# `variables_of_interest` and the `batch_effects`, coloring according to | ||
# that variables groups. | ||
# If you want PCA plots for further columns in the samples.tsv sheet, you | ||
# can request them under labels as a list, for example: | ||
# - relatively_uninteresting_variable_X | ||
# - possible_batch_effect_Y | ||
labels: | ||
- condition | ||
|
||
diffexp: | ||
# variables where you are interested in whether they have | ||
# an effect on expression levels | ||
variables_of_interest: | ||
condition: | ||
# any fold change will be relative to this factor level | ||
base_level: untreated | ||
batch_effects: "" | ||
# contrasts for the deseq2 results method to determine fold changes | ||
contrasts: | ||
treated-vs-untreated: | ||
# must be one of the variables_of_interest | ||
variable_of_interest: condition | ||
level_of_interest: treated | ||
# The default model includes all interactions among variables_of_interest | ||
# and batch_effects added on. For the example above this implicitly is: | ||
# model: ~condition | ||
# For the default model to be used, simply specify an empty `model: ""` | ||
# With more variables_of_interest or batch_effects, you could introduce different | ||
# assumptions into your model, by specicifying a different model here. | ||
model: ~condition | ||
|
||
params: | ||
cutadapt-pe: "" | ||
cutadapt-se: "" | ||
star: "" |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
sample_name unit_name fq1 fq2 sra adapters strandedness | ||
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq | ||
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq | ||
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq | ||
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq | ||
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA yes | ||
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none | ||
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none | ||
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA reverse |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# path or URL to sample sheet (TSV format, columns: sample, condition, ...) | ||
samples: config_complex/samples.tsv | ||
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2) | ||
# Units are technical replicates (e.g. lanes, or resequencing of the same biological | ||
# sample). | ||
units: config_complex/units.tsv | ||
|
||
|
||
ref: | ||
# Ensembl species name | ||
species: saccharomyces_cerevisiae | ||
# Ensembl release | ||
release: 100 | ||
# Genome build | ||
build: R64-1-1 | ||
|
||
|
||
trimming: | ||
# If you activate trimming by setting this to `True`, you will have to | ||
# specify the respective cutadapt adapter trimming flag for each unit | ||
# in the `units.tsv` file's `adapters` column | ||
activate: False | ||
|
||
mergeReads: | ||
activate: False | ||
|
||
pca: | ||
activate: True | ||
# Per default, a separate PCA plot is generated for each of the | ||
# `variables_of_interest` and the `batch_effects`, coloring according to | ||
# that variables groups. | ||
# If you want PCA plots for further columns in the samples.tsv sheet, you | ||
# can request them under labels as a list, for example: | ||
# - relatively_uninteresting_variable_X | ||
# - possible_batch_effect_Y | ||
labels: | ||
# columns of sample sheet to use for PCA | ||
- jointly_handled | ||
|
||
diffexp: | ||
# variables where you are interested in whether they have | ||
# an effect on expression levels | ||
variables_of_interest: | ||
treatment_1: | ||
# any fold change will be relative to this factor level | ||
base_level: untreated | ||
treatment_2: | ||
# any fold change will be relative to this factor level | ||
base_level: untreated | ||
batch_effects: | ||
- jointly_handled | ||
# contrasts for the deseq2 results method to determine fold changes | ||
contrasts: | ||
treatment_1_alone: | ||
# must be one of the variables_of_interest | ||
variable_of_interest: treatment_1 | ||
# the variable's level to test against the base_level | ||
level_of_interest: treated | ||
treatment_2_alone: | ||
# must be one of the variables_of_interest | ||
variable_of_interest: treatment_2 | ||
# the variable's level to test against the base_level | ||
level_of_interest: treated | ||
# Must be a valid expression for option two in the contrasts description | ||
# of ?results in the DESeq2 package. For a more detailed intro, also see: | ||
# https://github.com/tavareshugo/tutorial_DESeq2_contrasts/blob/main/DESeq2_contrasts.md | ||
both_treatments: 'list(c("treatment_1_treated_vs_untreated", "treatment_2_treated_vs_untreated", "treatment_1treated.treatment_2treated"))' | ||
# The default model includes all interactions among variables_of_interest, | ||
# and batch_effects added on. For the example above this implicitly is: | ||
# model: ~jointly_handled + treatment_1 * treatment_2 | ||
# For the default model to be used, simply specify an empty `model: ""` below. | ||
# If you want to introduce different assumptions into your model, you can | ||
# specify a different model to use, for example skipping the interaction: | ||
# model: ~jointly_handled + treatment_1 + treatment_2 | ||
model: "" | ||
|
||
params: | ||
cutadapt-pe: "" | ||
cutadapt-se: "" | ||
star: "" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
sample_name treatment_1 treatment_2 jointly_handled | ||
A1 treated treated 1 | ||
A2 treated treated 2 | ||
A3 treated untreated 1 | ||
A4 treated untreated 2 | ||
B1 untreated treated 1 | ||
B2 untreated treated 2 | ||
B3 untreated untreated 1 | ||
B4 untreated untreated 2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
sample_name unit_name fq1 fq2 sra adapters strandedness | ||
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq | ||
A2 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq | ||
A3 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq | ||
A4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq | ||
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq | ||
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq | ||
B3 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq | ||
B4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,54 @@ | ||
# General settings | ||
To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file. | ||
# General configuration | ||
|
||
# Sample and unit sheet | ||
To configure this workflow, modify `config/config.yaml` according to your needs, following the explanations provided in the file. | ||
|
||
* Add samples to `config/samples.tsv`. For each sample, the columns `sample_name`, and `condition` have to be defined. The `condition` (healthy/tumor, before Treatment / after Treatment) will be used as contrast for the DEG analysis in DESeq2. To include other relevant variables such as batches, add a new column to the sheet. | ||
* For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet `config/units.tsv`. By activating or deactivating `mergeReads` in the `config/config.yaml`, you can decide wether to merge replicates or run them individually. For each unit, define adapters, and either one (column `fq1`) or two (columns `fq1`, `fq2`) FASTQ files (these can point to anywhere in your system). Alternatively, you can define an SRA (sequence read archive) accession (starting with e.g. ERR or SRR) by using a column `sra`. In the latter case, the pipeline will automatically download the corresponding paired end reads from SRA. If both local files and SRA accession are available, the local files will be preferred. | ||
To choose the correct geneCounts produced by STAR, you can define the strandedness of a unit. STAR produces counts for unstranded ('None' - default), forward oriented ('yes') and reverse oriented ('reverse') protocols. | ||
## `DESeq2` differential expression analysis configuration | ||
|
||
To successfully run the differential expression analysis, you will need to tell DESeq2 which sample annotations to use (annotations are columns in the `samples.tsv` file described below). | ||
This is done in the `config.yaml` file with the entries under `diffexp:`. | ||
The comments for the entries should give all the necessary infos and linkouts. | ||
But if in doubt, please also consult the [`DESeq2` manual](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). | ||
|
||
# Sample and unit setup | ||
|
||
The sample and unit setup is specified via tab-separated tabular files (`.tsv`). | ||
Missing values can be specified by empty columns or by writing `NA`. | ||
|
||
# DESeq scenario | ||
## sample sheet | ||
|
||
The default sample sheet is `config/samples.tsv` (as configured in `config/config.yaml`). | ||
Each sample refers to an actual physical sample, and replicates (both biological and technical) may be specified as separate samples. | ||
For each sample, you will always have to specify a `sample_name`. | ||
In addition, all `variables_of_interest` and `batch_effects` specified in the `config/config.yaml` under the `diffexp:` entry, will have to have corresponding columns in the `config/samples.tsv`. | ||
Finally, the sample sheet can contain any number of additional columns. | ||
So if in doubt about whether you might at some point need some metadata you already have at hand, just put it into the sample sheet already---your future self will thank you. | ||
|
||
## unit sheet | ||
|
||
The default unit sheet is `config/units.tsv` (as configured in `config/config.yaml`). | ||
For each sample, add one or more sequencing units (for example if you have several runs or lanes per sample). | ||
|
||
### `.fastq` file source | ||
|
||
For each unit, you will have to define a source for your `.fastq` files. | ||
This can be done via the columns `fq1`, `fq2` and `sra`, with either of: | ||
1. A single `.fastq` file for single-end reads (`fq1` column only; `fq2` and `sra` columns present, but empty). | ||
The entry can be any path on your system, but we suggest something like a `raw/` data directory within your analysis directory. | ||
2. Two `.fastq` files for paired-end reads (columns `fq1` and `fq2`; column `sra` present, but empty). | ||
As for the `fq1` column, the `fq2` column can also point to anywhere on your system. | ||
3. A sequence read archive (SRA) accession number (`sra` column only; `fq1` and `fq2` columns present, but empty). | ||
The workflow will automatically download the corresponding `.fastq` data (currently assumed to be paired-end). | ||
The accession numbers usually start with SRR or ERR and you can find accession numbers for studies of interest with the [SRA Run Selector](https://trace.ncbi.nlm.nih.gov/Traces/study/). | ||
If both local files and an SRA accession are specified for the same unit, the local files will be used. | ||
|
||
### adapter trimming | ||
|
||
If you set `trimming: activate:` in the `config/config.yaml` to `True`, you will have to provide at least one `cutadapt` adapter argument for each unit in the `adapters` column of the `units.tsv` file. | ||
You will need to find out the adapters used in the sequencing protocol that generated a unit: from your sequencing provider, or for published data from the study's metadata (or its authors). | ||
Then, enter the adapter sequences into the `adapters` column of that unit, preceded by the [correct `cutadapt` adapter argument](https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types). | ||
|
||
To initialize the DEG analysis, you need to define a model in the `config/config.yaml`. The model can include all variables introduced as columns in `config/samples.tsv`. | ||
* The standard model is `~condition` - to include a batch variable, write `~batch + condition`. | ||
### strandedness of library preparation protocol | ||
|
||
To get the correct `geneCounts` from `STAR` output, you can provide information on the strandedness of the library preparation protocol used for a unit. | ||
`STAR` can produce counts for unstranded (`none` - this is the default), forward oriented (`yes`) and reverse oriented (`reverse`) protocols. | ||
Enter the respective value into a `strandedness` column in the `units.tsv` file. |
Oops, something went wrong.