Skip to content

Commit

Permalink
feat!: generalize away condition (#66)
Browse files Browse the repository at this point in the history
* consistent ordering is ensured right after data import here, so this comment is obsolete

* apply R lints

* refactor variable names

* apply more lints

* overhaul config setup and make tests more complex

* make deseq-init.R configurable via new config setup

* apply lints to deseq2.R

* actually use config.diffexp.model formula if specified

* extend doc comments in config files

* make deseq2.R contrast definitions use config.yaml specifications

* fix rule align (star) parameters and make gtf an actual input

* make samples.tsv and units.tsv complex enough for test case

* add trimming explanation to config.yaml

* overhaul config/README.md for snakemake workflow catalog explanations

* apply lints to plot-pca.R

* overhaul PCA plots (plots for all relevant variables, further one requestable, one separate plot per variable)

* fix typos in .test/config_complex/config.yaml

* fix syntax for complex contrasts

* provide more useful error message for typos

* add missing comma

* add example adapters and strandedness entries to .test/config_basic/units.tsv to actually test the respective code

* final check of config/README.md

* snakefmt

* fix schema for pca: labels:

* use config_complex workflow for linting

* fix formatting that is only caught in CI

* fix cutadapt wrapper params from others: to extra: (originally suggest by @kilpert: f816550)

* update actions, hopefully getting newest snakefmt to avoid spurious errors

* add basic wildcard_constraints for safer sample and unit names (originally suggested by @kilpert: 5431e08)

* snakefmt

* fix copy-pasta oversight

* fix config/README.md and units.tsv strandedness to read `none`

* catch case of no batch_effects ("") in config
  • Loading branch information
dlaehnemann authored Apr 29, 2023
1 parent 896007e commit e103c1c
Show file tree
Hide file tree
Showing 23 changed files with 453 additions and 123 deletions.
30 changes: 21 additions & 9 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout with submodules
uses: actions/checkout@v2
uses: actions/checkout@v3
with:
submodules: recursive
fetch-depth: 0
- name: Formatting
uses: github/super-linter@v4
uses: github/super-linter@v5
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: master
Expand All @@ -26,13 +26,13 @@ jobs:
linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Linting
uses: snakemake/[email protected]
with:
directory: .test
snakefile: workflow/Snakefile
args: "--lint"
args: "--configfile .test/config_complex/config.yaml --lint"

run-workflow:
runs-on: ubuntu-latest
Expand All @@ -41,18 +41,30 @@ jobs:
- formatting
steps:
- name: Checkout repository with submodules
uses: actions/checkout@v2
uses: actions/checkout@v3
with:
submodules: recursive
- name: Test workflow
- name: Test workflow (basic model, no batch_effects)
uses: snakemake/[email protected]
with:
directory: .test
snakefile: workflow/Snakefile
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
- name: Test report
args: "--configfile .test/config_basic/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
- name: Test report (basic model, no batch_effects)
uses: snakemake/[email protected]
with:
directory: .test
snakefile: workflow/Snakefile
args: "--report report.zip"
args: "--configfile .test/config_basic/config.yaml --report report.zip"
- name: Test workflow (multiple variables_of_interest, include batch_effects)
uses: snakemake/[email protected]
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config_complex/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
- name: Test report (multiple variables_of_interest, include batch_effects)
uses: snakemake/[email protected]
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config_complex/config.yaml --report report.zip"
41 changes: 0 additions & 41 deletions .test/config/config.yaml

This file was deleted.

64 changes: 64 additions & 0 deletions .test/config_basic/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
samples: config_basic/samples.tsv
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
# Units are technical replicates (e.g. lanes, or resequencing of the same biological
# sample).
units: config_basic/units.tsv


ref:
# Ensembl species name
species: saccharomyces_cerevisiae
# Ensembl release
release: 100
# Genome build
build: R64-1-1


trimming:
# If you activate trimming by setting this to `True`, you will have to
# specify the respective cutadapt adapter trimming flag for each unit
# in the `units.tsv` file's `adapters` column
activate: True

mergeReads:
activate: False

pca:
activate: True
# Per default, a separate PCA plot is generated for each of the
# `variables_of_interest` and the `batch_effects`, coloring according to
# that variables groups.
# If you want PCA plots for further columns in the samples.tsv sheet, you
# can request them under labels as a list, for example:
# - relatively_uninteresting_variable_X
# - possible_batch_effect_Y
labels:
- condition

diffexp:
# variables where you are interested in whether they have
# an effect on expression levels
variables_of_interest:
condition:
# any fold change will be relative to this factor level
base_level: untreated
batch_effects: ""
# contrasts for the deseq2 results method to determine fold changes
contrasts:
treated-vs-untreated:
# must be one of the variables_of_interest
variable_of_interest: condition
level_of_interest: treated
# The default model includes all interactions among variables_of_interest
# and batch_effects added on. For the example above this implicitly is:
# model: ~condition
# For the default model to be used, simply specify an empty `model: ""`
# With more variables_of_interest or batch_effects, you could introduce different
# assumptions into your model, by specicifying a different model here.
model: ~condition

params:
cutadapt-pe: ""
cutadapt-se: ""
star: ""
File renamed without changes.
8 changes: 4 additions & 4 deletions .test/config/units.tsv → .test/config_basic/units.tsv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample_name unit_name fq1 fq2 sra adapters strandedness
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA yes
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA reverse
80 changes: 80 additions & 0 deletions .test/config_complex/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
samples: config_complex/samples.tsv
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
# Units are technical replicates (e.g. lanes, or resequencing of the same biological
# sample).
units: config_complex/units.tsv


ref:
# Ensembl species name
species: saccharomyces_cerevisiae
# Ensembl release
release: 100
# Genome build
build: R64-1-1


trimming:
# If you activate trimming by setting this to `True`, you will have to
# specify the respective cutadapt adapter trimming flag for each unit
# in the `units.tsv` file's `adapters` column
activate: False

mergeReads:
activate: False

pca:
activate: True
# Per default, a separate PCA plot is generated for each of the
# `variables_of_interest` and the `batch_effects`, coloring according to
# that variables groups.
# If you want PCA plots for further columns in the samples.tsv sheet, you
# can request them under labels as a list, for example:
# - relatively_uninteresting_variable_X
# - possible_batch_effect_Y
labels:
# columns of sample sheet to use for PCA
- jointly_handled

diffexp:
# variables where you are interested in whether they have
# an effect on expression levels
variables_of_interest:
treatment_1:
# any fold change will be relative to this factor level
base_level: untreated
treatment_2:
# any fold change will be relative to this factor level
base_level: untreated
batch_effects:
- jointly_handled
# contrasts for the deseq2 results method to determine fold changes
contrasts:
treatment_1_alone:
# must be one of the variables_of_interest
variable_of_interest: treatment_1
# the variable's level to test against the base_level
level_of_interest: treated
treatment_2_alone:
# must be one of the variables_of_interest
variable_of_interest: treatment_2
# the variable's level to test against the base_level
level_of_interest: treated
# Must be a valid expression for option two in the contrasts description
# of ?results in the DESeq2 package. For a more detailed intro, also see:
# https://github.com/tavareshugo/tutorial_DESeq2_contrasts/blob/main/DESeq2_contrasts.md
both_treatments: 'list(c("treatment_1_treated_vs_untreated", "treatment_2_treated_vs_untreated", "treatment_1treated.treatment_2treated"))'
# The default model includes all interactions among variables_of_interest,
# and batch_effects added on. For the example above this implicitly is:
# model: ~jointly_handled + treatment_1 * treatment_2
# For the default model to be used, simply specify an empty `model: ""` below.
# If you want to introduce different assumptions into your model, you can
# specify a different model to use, for example skipping the interaction:
# model: ~jointly_handled + treatment_1 + treatment_2
model: ""

params:
cutadapt-pe: ""
cutadapt-se: ""
star: ""
9 changes: 9 additions & 0 deletions .test/config_complex/samples.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
sample_name treatment_1 treatment_2 jointly_handled
A1 treated treated 1
A2 treated treated 2
A3 treated untreated 1
A4 treated untreated 2
B1 untreated treated 1
B2 untreated treated 2
B3 untreated untreated 1
B4 untreated untreated 2
9 changes: 9 additions & 0 deletions .test/config_complex/units.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
sample_name unit_name fq1 fq2 sra adapters strandedness
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
A2 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
A3 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
A4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
B3 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
B4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
56 changes: 47 additions & 9 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,54 @@
# General settings
To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file.
# General configuration

# Sample and unit sheet
To configure this workflow, modify `config/config.yaml` according to your needs, following the explanations provided in the file.

* Add samples to `config/samples.tsv`. For each sample, the columns `sample_name`, and `condition` have to be defined. The `condition` (healthy/tumor, before Treatment / after Treatment) will be used as contrast for the DEG analysis in DESeq2. To include other relevant variables such as batches, add a new column to the sheet.
* For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet `config/units.tsv`. By activating or deactivating `mergeReads` in the `config/config.yaml`, you can decide wether to merge replicates or run them individually. For each unit, define adapters, and either one (column `fq1`) or two (columns `fq1`, `fq2`) FASTQ files (these can point to anywhere in your system). Alternatively, you can define an SRA (sequence read archive) accession (starting with e.g. ERR or SRR) by using a column `sra`. In the latter case, the pipeline will automatically download the corresponding paired end reads from SRA. If both local files and SRA accession are available, the local files will be preferred.
To choose the correct geneCounts produced by STAR, you can define the strandedness of a unit. STAR produces counts for unstranded ('None' - default), forward oriented ('yes') and reverse oriented ('reverse') protocols.
## `DESeq2` differential expression analysis configuration

To successfully run the differential expression analysis, you will need to tell DESeq2 which sample annotations to use (annotations are columns in the `samples.tsv` file described below).
This is done in the `config.yaml` file with the entries under `diffexp:`.
The comments for the entries should give all the necessary infos and linkouts.
But if in doubt, please also consult the [`DESeq2` manual](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).

# Sample and unit setup

The sample and unit setup is specified via tab-separated tabular files (`.tsv`).
Missing values can be specified by empty columns or by writing `NA`.

# DESeq scenario
## sample sheet

The default sample sheet is `config/samples.tsv` (as configured in `config/config.yaml`).
Each sample refers to an actual physical sample, and replicates (both biological and technical) may be specified as separate samples.
For each sample, you will always have to specify a `sample_name`.
In addition, all `variables_of_interest` and `batch_effects` specified in the `config/config.yaml` under the `diffexp:` entry, will have to have corresponding columns in the `config/samples.tsv`.
Finally, the sample sheet can contain any number of additional columns.
So if in doubt about whether you might at some point need some metadata you already have at hand, just put it into the sample sheet already---your future self will thank you.

## unit sheet

The default unit sheet is `config/units.tsv` (as configured in `config/config.yaml`).
For each sample, add one or more sequencing units (for example if you have several runs or lanes per sample).

### `.fastq` file source

For each unit, you will have to define a source for your `.fastq` files.
This can be done via the columns `fq1`, `fq2` and `sra`, with either of:
1. A single `.fastq` file for single-end reads (`fq1` column only; `fq2` and `sra` columns present, but empty).
The entry can be any path on your system, but we suggest something like a `raw/` data directory within your analysis directory.
2. Two `.fastq` files for paired-end reads (columns `fq1` and `fq2`; column `sra` present, but empty).
As for the `fq1` column, the `fq2` column can also point to anywhere on your system.
3. A sequence read archive (SRA) accession number (`sra` column only; `fq1` and `fq2` columns present, but empty).
The workflow will automatically download the corresponding `.fastq` data (currently assumed to be paired-end).
The accession numbers usually start with SRR or ERR and you can find accession numbers for studies of interest with the [SRA Run Selector](https://trace.ncbi.nlm.nih.gov/Traces/study/).
If both local files and an SRA accession are specified for the same unit, the local files will be used.

### adapter trimming

If you set `trimming: activate:` in the `config/config.yaml` to `True`, you will have to provide at least one `cutadapt` adapter argument for each unit in the `adapters` column of the `units.tsv` file.
You will need to find out the adapters used in the sequencing protocol that generated a unit: from your sequencing provider, or for published data from the study's metadata (or its authors).
Then, enter the adapter sequences into the `adapters` column of that unit, preceded by the [correct `cutadapt` adapter argument](https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types).

To initialize the DEG analysis, you need to define a model in the `config/config.yaml`. The model can include all variables introduced as columns in `config/samples.tsv`.
* The standard model is `~condition` - to include a batch variable, write `~batch + condition`.
### strandedness of library preparation protocol

To get the correct `geneCounts` from `STAR` output, you can provide information on the strandedness of the library preparation protocol used for a unit.
`STAR` can produce counts for unstranded (`none` - this is the default), forward oriented (`yes`) and reverse oriented (`reverse`) protocols.
Enter the respective value into a `strandedness` column in the `units.tsv` file.
Loading

0 comments on commit e103c1c

Please sign in to comment.