feat!: generalize away condition (#66)

* consistent ordering is ensured right after data import here, so this comment is obsolete * apply R lints * refactor variable names * apply more lints * overhaul config setup and make tests more complex * make deseq-init.R configurable via new config setup * apply lints to deseq2.R * actually use config.diffexp.model formula if specified * extend doc comments in config files * make deseq2.R contrast definitions use config.yaml specifications * fix rule align (star) parameters and make gtf an actual input * make samples.tsv and units.tsv complex enough for test case * add trimming explanation to config.yaml * overhaul config/README.md for snakemake workflow catalog explanations * apply lints to plot-pca.R * overhaul PCA plots (plots for all relevant variables, further one requestable, one separate plot per variable) * fix typos in .test/config_complex/config.yaml * fix syntax for complex contrasts * provide more useful error message for typos * add missing comma * add example adapters and strandedness entries to .test/config_basic/units.tsv to actually test the respective code * final check of config/README.md * snakefmt * fix schema for pca: labels: * use config_complex workflow for linting * fix formatting that is only caught in CI * fix cutadapt wrapper params from others: to extra: (originally suggest by @kilpert: f816550) * update actions, hopefully getting newest snakefmt to avoid spurious errors * add basic wildcard_constraints for safer sample and unit names (originally suggested by @kilpert: 5431e08) * snakefmt * fix copy-pasta oversight * fix config/README.md and units.tsv strandedness to read `none` * catch case of no batch_effects ("") in config
snakemake-workflows · Apr 29, 2023 · e103c1c · e103c1c
1 parent 896007e
commit e103c1c
Show file tree

Hide file tree

Showing 23 changed files with 453 additions and 123 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -12,12 +12,12 @@ jobs:
     runs-on: ubuntu-latest
     steps:
     - name: Checkout with submodules
-      uses: actions/checkout@v2
+      uses: actions/checkout@v3
       with:
         submodules: recursive
         fetch-depth: 0
     - name: Formatting
-      uses: github/super-linter@v4
+      uses: github/super-linter@v5
       env:
         VALIDATE_ALL_CODEBASE: false
         DEFAULT_BRANCH: master
@@ -26,13 +26,13 @@ jobs:
   linting:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
     - name: Linting
       uses: snakemake/[email protected]
       with:
         directory: .test
         snakefile: workflow/Snakefile
-        args: "--lint"
+        args: "--configfile .test/config_complex/config.yaml --lint"
 
   run-workflow:
     runs-on: ubuntu-latest
@@ -41,18 +41,30 @@ jobs:
       - formatting
     steps:
     - name: Checkout repository with submodules
-      uses: actions/checkout@v2
+      uses: actions/checkout@v3
       with:
         submodules: recursive
-    - name: Test workflow
+    - name: Test workflow (basic model, no batch_effects)
       uses: snakemake/[email protected]
       with:
         directory: .test
         snakefile: workflow/Snakefile
-        args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
-    - name: Test report
+        args: "--configfile .test/config_basic/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
+    - name: Test report (basic model, no batch_effects)
       uses: snakemake/[email protected]
       with:
         directory: .test
         snakefile: workflow/Snakefile
-        args: "--report report.zip"
+        args: "--configfile .test/config_basic/config.yaml --report report.zip"
+    - name: Test workflow (multiple variables_of_interest, include batch_effects)
+      uses: snakemake/[email protected]
+      with:
+        directory: .test
+        snakefile: workflow/Snakefile
+        args: "--configfile .test/config_complex/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
+    - name: Test report (multiple variables_of_interest, include batch_effects)
+      uses: snakemake/[email protected]
+      with:
+        directory: .test
+        snakefile: workflow/Snakefile
+        args: "--configfile .test/config_complex/config.yaml --report report.zip"
diff --git a/.test/config/config.yaml b/.test/config/config.yaml
diff --git a/.test/config_basic/config.yaml b/.test/config_basic/config.yaml
@@ -0,0 +1,64 @@
+# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
+samples: config_basic/samples.tsv
+# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
+# Units are technical replicates (e.g. lanes, or resequencing of the same biological
+# sample).
+units: config_basic/units.tsv
+
+
+ref:
+  # Ensembl species name
+  species: saccharomyces_cerevisiae
+  # Ensembl release
+  release: 100
+  # Genome build
+  build: R64-1-1
+
+
+trimming:
+  # If you activate trimming by setting this to `True`, you will have to
+  # specify the respective cutadapt adapter trimming flag for each unit
+  # in the `units.tsv` file's `adapters` column
+  activate: True
+
+mergeReads:
+  activate: False
+
+pca:
+  activate: True
+  # Per default, a separate PCA plot is generated for each of the
+  # `variables_of_interest` and the `batch_effects`, coloring according to
+  # that variables groups.
+  # If you want PCA plots for further columns in the samples.tsv sheet, you
+  # can request them under labels as a list, for example:
+  # - relatively_uninteresting_variable_X
+  # - possible_batch_effect_Y
+  labels:
+    - condition
+
+diffexp:
+  # variables where you are interested in whether they have
+  # an effect on expression levels
+  variables_of_interest:
+    condition:
+      # any fold change will be relative to this factor level
+      base_level: untreated
+  batch_effects: ""
+  # contrasts for the deseq2 results method to determine fold changes
+  contrasts:
+    treated-vs-untreated:
+      # must be one of the variables_of_interest
+      variable_of_interest: condition
+      level_of_interest: treated
+  # The default model includes all interactions among variables_of_interest
+  # and batch_effects added on. For the example above this implicitly is:
+  # model: ~condition
+  # For the default model to be used, simply specify an empty `model: ""`
+  # With more variables_of_interest or batch_effects, you could introduce different
+  # assumptions into your model, by specicifying a different model here.
+  model: ~condition
+
+params:
+  cutadapt-pe: ""
+  cutadapt-se: ""
+  star: ""
diff --git a/.test/config/samples.tsv → .test/config_basic/samples.tsv b/.test/config/samples.tsv → .test/config_basic/samples.tsv
diff --git a/.test/config/units.tsv → .test/config_basic/units.tsv b/.test/config/units.tsv → .test/config_basic/units.tsv
@@ -1,5 +1,5 @@
 sample_name	unit_name	fq1	fq2	sra	adapters	strandedness
-A1	1	ngs-test-data/reads/a.scerevisiae.1.fq	ngs-test-data/reads/a.scerevisiae.2.fq			
-B1	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
-A2	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
-B2	1	ngs-test-data/reads/b.scerevisiae.1.fq	ngs-test-data/reads/b.scerevisiae.2.fq			
+A1	1	ngs-test-data/reads/a.scerevisiae.1.fq	ngs-test-data/reads/a.scerevisiae.2.fq		-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	yes
+B1	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq		-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	none
+A2	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq		-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	none
+B2	1	ngs-test-data/reads/b.scerevisiae.1.fq	ngs-test-data/reads/b.scerevisiae.2.fq		-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	reverse
diff --git a/.test/config_complex/config.yaml b/.test/config_complex/config.yaml
@@ -0,0 +1,80 @@
+# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
+samples: config_complex/samples.tsv
+# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
+# Units are technical replicates (e.g. lanes, or resequencing of the same biological
+# sample).
+units: config_complex/units.tsv
+
+
+ref:
+  # Ensembl species name
+  species: saccharomyces_cerevisiae
+  # Ensembl release
+  release: 100
+  # Genome build
+  build: R64-1-1
+
+
+trimming:
+  # If you activate trimming by setting this to `True`, you will have to
+  # specify the respective cutadapt adapter trimming flag for each unit
+  # in the `units.tsv` file's `adapters` column
+  activate: False
+
+mergeReads:
+  activate: False
+
+pca:
+  activate: True
+  # Per default, a separate PCA plot is generated for each of the
+  # `variables_of_interest` and the `batch_effects`, coloring according to
+  # that variables groups.
+  # If you want PCA plots for further columns in the samples.tsv sheet, you
+  # can request them under labels as a list, for example:
+  # - relatively_uninteresting_variable_X
+  # - possible_batch_effect_Y
+  labels:
+    # columns of sample sheet to use for PCA
+    - jointly_handled
+
+diffexp:
+  # variables where you are interested in whether they have
+  # an effect on expression levels
+  variables_of_interest:
+    treatment_1:
+      # any fold change will be relative to this factor level
+      base_level: untreated
+    treatment_2:
+      # any fold change will be relative to this factor level
+      base_level: untreated
+  batch_effects:
+    - jointly_handled
+  # contrasts for the deseq2 results method to determine fold changes
+  contrasts:
+    treatment_1_alone:
+      # must be one of the variables_of_interest
+      variable_of_interest: treatment_1
+      # the variable's level to test against the base_level
+      level_of_interest: treated
+    treatment_2_alone:
+      # must be one of the variables_of_interest
+      variable_of_interest: treatment_2
+      # the variable's level to test against the base_level
+      level_of_interest: treated
+    # Must be a valid expression for option two in the contrasts description
+    # of ?results in the DESeq2 package. For a more detailed intro, also see:
+    # https://github.com/tavareshugo/tutorial_DESeq2_contrasts/blob/main/DESeq2_contrasts.md
+    both_treatments: 'list(c("treatment_1_treated_vs_untreated", "treatment_2_treated_vs_untreated", "treatment_1treated.treatment_2treated"))'
+  # The default model includes all interactions among variables_of_interest,
+  # and batch_effects added on. For the example above this implicitly is:
+  # model: ~jointly_handled + treatment_1 * treatment_2
+  # For the default model to be used, simply specify an empty `model: ""` below.
+  # If you want to introduce different assumptions into your model, you can
+  # specify a different model to use, for example skipping the interaction:
+  # model: ~jointly_handled + treatment_1 + treatment_2
+  model: ""
+
+params:
+  cutadapt-pe: ""
+  cutadapt-se: ""
+  star: ""
diff --git a/.test/config_complex/samples.tsv b/.test/config_complex/samples.tsv
@@ -0,0 +1,9 @@
+sample_name	treatment_1	treatment_2	jointly_handled
+A1	treated	treated	1
+A2	treated	treated	2
+A3	treated	untreated	1
+A4	treated	untreated	2
+B1	untreated	treated	1
+B2	untreated	treated	2
+B3	untreated	untreated	1
+B4	untreated	untreated	2
diff --git a/.test/config_complex/units.tsv b/.test/config_complex/units.tsv
@@ -0,0 +1,9 @@
+sample_name	unit_name	fq1	fq2	sra	adapters	strandedness
+A1	1	ngs-test-data/reads/a.scerevisiae.1.fq	ngs-test-data/reads/a.scerevisiae.2.fq			
+A2	1	ngs-test-data/reads/a.scerevisiae.1.fq	ngs-test-data/reads/a.scerevisiae.2.fq			
+A3	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
+A4	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
+B1	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
+B2	1	ngs-test-data/reads/b.scerevisiae.1.fq	ngs-test-data/reads/b.scerevisiae.2.fq			
+B3	1	ngs-test-data/reads/b.scerevisiae.1.fq	ngs-test-data/reads/b.scerevisiae.2.fq			
+B4	1	ngs-test-data/reads/c.scerevisiae.1.fq	ngs-test-data/reads/c.scerevisiae.2.fq			
diff --git a/config/README.md b/config/README.md
@@ -1,16 +1,54 @@
-# General settings
-To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file.
+# General configuration
 
-# Sample and unit sheet
+To configure this workflow, modify `config/config.yaml` according to your needs, following the explanations provided in the file.
 
-* Add samples to `config/samples.tsv`. For each sample, the columns `sample_name`, and `condition` have to be defined. The `condition` (healthy/tumor, before Treatment / after Treatment) will be used as contrast for the DEG analysis in DESeq2. To include other relevant variables such as batches, add a new column to the sheet.
-* For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet `config/units.tsv`. By activating or deactivating `mergeReads` in the `config/config.yaml`, you can decide wether to merge replicates or run them individually. For each unit, define adapters, and either one (column `fq1`) or two (columns `fq1`, `fq2`) FASTQ files (these can point to anywhere in your system). Alternatively, you can define an SRA (sequence read archive) accession (starting with e.g. ERR or SRR) by using a column `sra`. In the latter case, the pipeline will automatically download the corresponding paired end reads from SRA. If both local files and SRA accession are available, the local files will be preferred.
-To choose the correct geneCounts produced by STAR, you can define the strandedness of a unit. STAR produces counts for unstranded ('None' - default), forward oriented ('yes') and reverse oriented ('reverse') protocols.  
+## `DESeq2` differential expression analysis configuration
 
+To successfully run the differential expression analysis, you will need to tell DESeq2 which sample annotations to use (annotations are columns in the `samples.tsv` file described below).
+This is done in the `config.yaml` file with the entries under `diffexp:`.
+The comments for the entries should give all the necessary infos and linkouts.
+But if in doubt, please also consult the [`DESeq2` manual](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
+
+# Sample and unit setup
+
+The sample and unit setup is specified via tab-separated tabular files (`.tsv`).
 Missing values can be specified by empty columns or by writing `NA`.
 
-# DESeq scenario
+## sample sheet
+
+The default sample sheet is `config/samples.tsv` (as configured in `config/config.yaml`).
+Each sample refers to an actual physical sample, and replicates (both biological and technical) may be specified as separate samples.
+For each sample, you will always have to specify a `sample_name`.
+In addition, all `variables_of_interest` and `batch_effects` specified in the `config/config.yaml` under the `diffexp:` entry, will have to have corresponding columns in the `config/samples.tsv`.
+Finally, the sample sheet can contain any number of additional columns.
+So if in doubt about whether you might at some point need some metadata you already have at hand, just put it into the sample sheet already---your future self will thank you.
+
+## unit sheet
+
+The default unit sheet is `config/units.tsv` (as configured in `config/config.yaml`).
+For each sample, add one or more sequencing units (for example if you have several runs or lanes per sample).
+
+### `.fastq` file source
+
+For each unit, you will have to define a source for your `.fastq` files.
+This can be done via the columns `fq1`, `fq2` and `sra`, with either of:
+1. A single `.fastq` file for single-end reads (`fq1` column only; `fq2` and `sra` columns present, but empty).
+  The entry can be any path on your system, but we suggest something like a `raw/` data directory within your analysis directory.
+2. Two `.fastq` files for paired-end reads (columns `fq1` and `fq2`; column `sra` present, but empty).
+  As for the `fq1` column, the `fq2` column can also point to anywhere on your system.
+3. A sequence read archive (SRA) accession number (`sra` column only; `fq1` and `fq2` columns present, but empty).
+  The workflow will automatically download the corresponding `.fastq` data (currently assumed to be paired-end).
+  The accession numbers usually start with SRR or ERR and you can find accession numbers for studies of interest with the [SRA Run Selector](https://trace.ncbi.nlm.nih.gov/Traces/study/).
+If both local files and an SRA accession are specified for the same unit, the local files will be used.
+
+### adapter trimming
+
+If you set `trimming: activate:` in the `config/config.yaml` to `True`, you will have to provide at least one `cutadapt` adapter argument for each unit in the `adapters` column of the `units.tsv` file.
+You will need to find out the adapters used in the sequencing protocol that generated a unit: from your sequencing provider, or for published data from the study's metadata (or its authors).
+Then, enter the adapter sequences into the `adapters` column of that unit, preceded by the [correct `cutadapt` adapter argument](https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types).
 
-To initialize the DEG analysis, you need to define a model in the `config/config.yaml`. The model can include all variables introduced as columns in `config/samples.tsv`.
-* The standard model is `~condition` - to include a batch variable, write `~batch + condition`.
+### strandedness of library preparation protocol
 
+To get the correct `geneCounts` from `STAR` output, you can provide information on the strandedness of the library preparation protocol used for a unit.
+`STAR` can produce counts for unstranded (`none` - this is the default), forward oriented (`yes`) and reverse oriented (`reverse`) protocols.  
+Enter the respective value into a `strandedness` column in the `units.tsv` file.