diff --git a/.gitignore b/.gitignore index 23a8550..0cf8249 100644 --- a/.gitignore +++ b/.gitignore @@ -8,6 +8,10 @@ resources/** !.gitignore !.gitattributes !.editorconfig +**.conda +**.condarc +**.cache +**.java # Custom additions Notes.md diff --git a/README.md b/README.md index 107f29f..a01ba01 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # snakemake-bacterial-riboseq ![Platform](https://img.shields.io/badge/platform-all-green) -[![Snakemake](https://img.shields.io/badge/snakemake-≥7.0.0-brightgreen.svg)](https://snakemake.github.io) +[![Snakemake](https://img.shields.io/badge/snakemake-≥8.0.0-brightgreen.svg)](https://snakemake.github.io) [![Tests](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-bacterial-riboseq/actions/workflows/main.yml) [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/) [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1D355C.svg?labelColor=000000)](https://sylabs.io/docs/) @@ -33,26 +33,27 @@ If you use this workflow in a paper, don't forget to give credits to the authors ## Workflow overview + ----------- +--- This workflow is a best-practice workflow for the analysis of ribosome footprint sequencing (Ribo-Seq) data. The workflow is built using [snakemake](https://snakemake.readthedocs.io/en/stable/) and consists of the following steps: - 1. Obtain genome database in `fasta` and `gff` format (`python`, [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/)) - 1. Using automatic download from NCBI with a `RefSeq` ID - 2. Using user-supplied files - 2. Check quality of input sequencing data (`FastQC`) - 3. Cut adapters and filter by length and/or sequencing quality score (`cutadapt`) - 4. Deduplicate reads by unique molecular identifier (UMI, `umi_tools`) - 5. Map reads to the reference genome (`STAR aligner`) - 6. Sort and index for aligned seq data (`samtools`) - 7. Filter reads by feature type (`bedtools`) - 8. Generate summary report for all processing steps (`MultiQC`) - 9. Shift ribo-seq reads according to the ribosome's P-site alignment (`R`, `ORFik`) - 10. Calculate basic gene-wise statistics such as RPKM (`R`, `ORFik`) - 11. Return report as HTML and PDF files (`R markdown`, `weasyprint`) +1. Obtain genome database in `fasta` and `gff` format (`python`, [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/)) + 1. Using automatic download from NCBI with a `RefSeq` ID + 2. Using user-supplied files +2. Check quality of input sequencing data (`FastQC`) +3. Cut adapters and filter by length and/or sequencing quality score (`cutadapt`) +4. Deduplicate reads by unique molecular identifier (UMI, `umi_tools`) +5. Map reads to the reference genome (`STAR aligner`) +6. Sort and index for aligned seq data (`samtools`) +7. Filter reads by feature type (`bedtools`) +8. Generate summary report for all processing steps (`MultiQC`) +9. Shift ribo-seq reads according to the ribosome's P-site alignment (`R`, `ORFik`) +10. Calculate basic gene-wise statistics such as RPKM (`R`, `ORFik`) +11. Return report as HTML and PDF files (`R markdown`, `weasyprint`) If you want to contribute, report issues, or suggest features, please get in touch on [github](https://github.com/MPUSP/snakemake-bacterial-riboseq). @@ -67,18 +68,7 @@ cd snakemake-bacterial-riboseq **Step 2: Install dependencies** -It is recommended to install snakemake and run the workflow with `conda`, `mamba` or `micromamba`. - -```bash -# download Miniconda3 installer -wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -# install Conda (respond by 'yes') -bash miniconda.sh -# update Conda -conda update -y conda -# install Mamba -conda install -n base -c conda-forge -y mamba -``` +It is recommended to install snakemake and run the workflow with `conda` or `mamba`. [Miniforge](https://conda-forge.org/download/) is the preferred conda-forge installer and includes `conda`, `mamba` and their dependencies. **Step 3: Create snakemake environment** @@ -94,7 +84,6 @@ conda activate snakemake-bacterial-riboseq All other dependencies for the workflow are **automatically pulled as `conda` environments** by snakemake, when running the workflow with the `--use-conda` parameter (recommended). - ## Running the workflow ### Input data @@ -122,7 +111,7 @@ Ribosome footprint sequencing data in `*.fastq.gz` format. The currently support Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the `config.yml` file. For example: - Minimum and maximum read length after adapter removal (see option `cutadapt: default`). Here, the test data has a minimum read length of 15 + 7 = 22 (2 nt on 5'end + 5 nt on 3'end), and a maximum of 45 + 7 = 52. -- Unique molecular identifiers (UMIs). For example, the protocol by [McGlincy & Ingolia, 2017](https://doi.org/10.1016/J.YMETH.2017.05.028) creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with `umi_tools` (see options `umi_extraction: method` and `pattern`). +- Unique molecular identifiers (UMIs). For example, the protocol by [McGlincy & Ingolia, 2017](https://doi.org/10.1016/J.YMETH.2017.05.028) creates a UMI that is located on both the 5'-end (2 nt) and the 3'-end (5 nt). These UMIs are extracted with `umi_tools` (see options `umi_extraction: method` and `pattern`). Example configuration files for different sequencing protocols can be found in `resources/protocols/`. @@ -144,13 +133,13 @@ snakemake --dry-run To run the complete workflow with test files using **`conda`**, execute the following command. The definition of the number of compute cores is mandatory. ```bash -snakemake --cores 10 --use-conda --directory .test +snakemake --cores 10 --sdm conda --directory .test ``` -To run the workflow with **singularity**, use: +To run the workflow with **singularity** / **apptainer**, use: ```bash -snakemake --cores 10 --use-singularity --use-conda --directory .test +snakemake --cores 10 --sdm conda apptainer --directory .test ``` ### Parameters @@ -218,11 +207,11 @@ This table lists all parameters that can be used to run the workflow. - ORCID profile: https://orcid.org/0000-0002-3913-153X - github page: https://github.com/m-jahn - Visit the MPUSP github page at https://github.com/MPUSP for more info on this workflow and other projects. ## References - Essential tools are linked in the top section of this document - The sequencing library preparation is based on the publication: + > McGlincy, N. J., & Ingolia, N. T. _Transcriptome-wide measurement of translation by ribosome profiling_. Methods, 126, 112–129, **2017**. https://doi.org/10.1016/J.YMETH.2017.05.028. diff --git a/workflow/Snakefile b/workflow/Snakefile index ab52982..f1551e0 100644 --- a/workflow/Snakefile +++ b/workflow/Snakefile @@ -8,13 +8,14 @@ # may apply. # # ----------------------------------------------------- # +conda: + "envs/global.yml" + import os import pandas as pd from datetime import date from snakemake.utils import min_version -# min_version("7.0") - __author__ = "Rina Ahmed-Begrich, Michael Jahn" __year__ = str(date.today()).split("-")[0] diff --git a/workflow/envs/global.yml b/workflow/envs/global.yml new file mode 100644 index 0000000..8c334a1 --- /dev/null +++ b/workflow/envs/global.yml @@ -0,0 +1,6 @@ +name: global +channels: + - conda-forge + - bioconda +dependencies: + - pandas=2.2.2 diff --git a/workflow/notebooks/report.Rmd b/workflow/notebooks/report.Rmd index dea786f..f32828a 100644 --- a/workflow/notebooks/report.Rmd +++ b/workflow/notebooks/report.Rmd @@ -140,7 +140,6 @@ Imported results: # tables df_annotated <- read_csv(snakemake@input$orfs_annotated, show_col_types = FALSE) df_features <- read_csv(snakemake@input$orfs_features, show_col_types = FALSE) -df_features <- mutate(df_features, sample_name = str_remove(sample_name, "\\.bam$")) # import original and processed read data bam <- lapply(snakemake@input[["bam_filtered"]], ORFik::readBam) @@ -153,6 +152,11 @@ names(bam_shift) <- str_split_i(snakemake@input[["bam_shifted"]], "\\/", -1) %>% sample_names <- names(bam) +# condition table +df_conditions <- df_features %>% + dplyr::select(sample_name, condition, replicate) %>% + distinct() + # GRanges load(snakemake@input$granges) @@ -192,21 +196,32 @@ head(df_features) - the default behavior is to filter out reads that are too short to be reliable footprints (config option `shift_reads: read_length`) ```{r, echo = FALSE, warning = FALSE, fig.width = figwidth, fig.height = figheight} -plot_read_number <- c(bam, bam_shift) %>% +df_read_number <- c(bam, bam_shift) %>% lapply(length) %>% unlist() %>% - enframe() %>% + enframe(name = "sample_name", value = "count") %>% mutate(status = rep(c("original", "filtered"), each = length(sample_names))) %>% - ggplot(aes(y = fct_rev(name), x = value, fill = status)) + - geom_col(position = "dodge") + + left_join(df_conditions, by = "sample_name") %>% + mutate(replicate = fct_inseq(factor(replicate))) + +plot_read_number <- df_read_number %>% + ggplot(aes(y = fct_rev(condition), x = count, fill = replicate)) + + geom_col(data = . %>% filter(status == "original"), position = position_dodge(width = 0.9), alpha = 0.4) + + geom_col(data = . %>% filter(status == "filtered"), position = position_dodge(width = 0.9), alpha = 1.0) + + geom_text( + data = . %>% filter(status == "original"), + position = position_dodge(0.9), hjust = 0, size = 3, + aes(x = count + max(count / 100), label = paste0(round(count / 10^6, 1), "M"), color = replicate) + ) + geom_text( + data = . %>% filter(status == "filtered"), position = position_dodge(0.9), hjust = 0, size = 3, - aes(x = value + max(value / 100), label = paste0(round(value / 10^6, 1), "M"), color = status) + aes(x = count + max(count / 100), label = paste0(round(count / 10^6, 1), "M"), color = replicate) ) + labs( x = "n total reads", y = "", title = "Read number for all samples", - subtitle = "color: before and after filtering by size" + subtitle = "color: replicate, transparency: before and after filtering by size" ) + custom_theme(legend.position = "bottom") + scale_fill_manual(values = custom_colors) + @@ -225,7 +240,7 @@ print(plot_read_number) ### Read length distribution -- this figure shows read length distribution for all samples and data types +- this figure shows read length distribution for all samples - mapped read lengths from ribosome profiling data are usually variable - often one can see a bimodal distribution, with the larger reads being more reliable (representing genuine footprints) - shorter read fractions (e.g. shorter than a regular footprint) are ususally discarded @@ -239,22 +254,27 @@ df_read_lengths <- lapply(bam, function(file) { enframe(name = "read_length", value = "read_count") %>% mutate(read_count = as.numeric(read_count)) }) %>% - bind_rows(.id = "name") + bind_rows(.id = "sample_name") %>% + left_join(df_conditions, by = "sample_name") %>% + mutate(replicate = fct_inseq(factor(replicate))) plot_read_length <- df_read_lengths %>% - group_by(name) %>% + group_by(condition, replicate) %>% mutate(read_count = read_count / sum(read_count) * 100) %>% ggplot(aes( x = as.numeric(read_length), - y = read_count + y = read_count, + color = condition, + linetype = replicate )) + geom_step() + labs( x = "read length", y = "frequency (% total reads)", - title = "Read lengths distribution, per sample" + title = "Read lengths distribution, per condition", + subtitle = "color: condition, line type: replicate" ) + custom_theme() + - facet_wrap(~name) + + facet_wrap(~condition) + scale_color_manual(values = custom_colors) if (export_figures) { @@ -599,26 +619,27 @@ print(plot_orfs_anno) ```{r, echo = FALSE, warning = FALSE} df_feat_long <- df_features %>% dplyr::select( - group_name, sample_name, countRFP, cpmRFP, fpkmRFP, floss, + group_name, condition, replicate, countRFP, cpmRFP, fpkmRFP, floss, entropyRFP, disengagementScores, RRS, RSS, ORFScores, ioScore, startCodonCoverage, startRegionCoverage, startRegionRelative ) %>% - mutate(across(!c("group_name", "sample_name"), ~ log10(abs(.x)))) %>% - pivot_longer(cols = !c("group_name", "sample_name"), names_to = "feature", values_to = "value") + mutate(across(!c("group_name", "condition", "replicate"), ~ log10(abs(.x)))) %>% + pivot_longer(cols = !c("group_name", "condition", "replicate"), names_to = "feature", values_to = "value") ``` ```{r, echo = FALSE, warning = FALSE, fig.width = figwidth, fig.height = figwidth} plot_data_features <- df_feat_long %>% filter(!is.infinite(value), !is.na(value)) %>% - group_by(sample_name, feature) %>% + group_by(condition, feature, group_name) %>% + summarize(value = mean(value, na.rm = TRUE)) %>% mutate(bins = cut_interval(value, n = 40, labels = FALSE)) %>% group_by(bins, .add = TRUE) %>% summarize(freq = n(), .groups = "drop") %>% - ggplot(aes(x = bins, y = freq, color = sample_name)) + + ggplot(aes(x = bins, y = freq, color = condition)) + geom_step() + labs( title = "Distribution of log10 transformed features", - subtitle = "x-axis is rescaled for compatibility", + subtitle = "mean of replicates; x-axis is rescaled for compatibility", x = "log10 normalized score", y = "frequency" ) + custom_theme(aspect.ratio = 1, legend.position = "bottom") + @@ -655,8 +676,10 @@ df_fpkm <- df_features %>% pivot_wider(id_cols = group_name, names_from = sample_name, values_from = fpkmRFP) %>% dplyr::select(-group_name) -plot_sample_corr <- ggpairs(df_fpkm) + +plot_sample_corr <- ggpairs(df_fpkm, aes(color = "")) + labs(title = "Correlation of log10 transformed FPKM") + + scale_color_manual(values = alpha(custom_colors, 0.7)) + + scale_fill_manual(values = alpha(custom_colors, 0.5)) + custom_theme() if (export_figures) { @@ -670,6 +693,55 @@ if (export_figures) { print(plot_sample_corr) ``` +#### Variation of read counts between replicates {-} + +- this figure shows a histogram of the coeffcient of variation (CV, or relative standard deviation) +- the CV was calculated based on all replicates of a sample +- a distribution where the majority of genes have a CV below 25% can be considered good +- if no replicates were used, this step is omitted + +```{r, echo = FALSE, warning = FALSE, fig.width = figwidth, fig.height = figwidth} +plot_cv <- df_features %>% + dplyr::select(group_name, condition, replicate, fpkmRFP) %>% + filter(fpkmRFP != 0) %>% + group_by(group_name, condition) %>% + summarize( + mean_fpkm = mean(fpkmRFP, na.rm = TRUE), + sd_fpkm = sd(fpkmRFP), + cv_fpkm = sd_fpkm / mean_fpkm + ) %>% + ungroup() %>% + mutate( + bins = cut_width(cv_fpkm, width = 0.05), + bins = as.numeric(str_extract(bins, "[0-9]*\\.[0-9]*")) + ) %>% + group_by(condition, bins, .add = TRUE) %>% + summarize(freq = n(), .groups = "drop") %>% + ggplot(aes(x = bins, y = freq, color = condition)) + + geom_step(show.legend = FALSE) + + labs( + title = "Distribution of coefficient of variation (CV)", + x = "CV", y = "frequency" + ) + + custom_theme(aspect.ratio = 1, legend.position = "bottom") + + facet_wrap(~condition) + + theme(axis.text.x = element_blank()) + + scale_color_manual(values = rep_len( + custom_colors, + length.out = length(unique(df_features$condition)) + )) + +if (export_figures) { + save_plot( + plot_sample_corr, + path = export_plots, + width = figwidth, height = figwidth + ) +} + +print(plot_cv) +``` + #### Principal component analysis {-} - this figure shows the correlation between samples/replicates for selected scores @@ -691,16 +763,22 @@ if (!"PC3" %in% colnames(df_pca)) { df_pca$PC3 <- 1 } +df_pca <- left_join( + df_pca, + df_conditions, + by = "sample_name" +) + plot_pca <- df_pca %>% ggplot(aes( - x = PC1, y = PC2, size = PC3, color = sample_name, - label = sample_name + x = PC1, y = PC2, size = PC3, color = condition, + label = replicate )) + geom_point() + - geom_text_repel(size = 3, point.padding = 10) + + geom_text_repel(size = 3, point.padding = 10, show.legend = FALSE) + labs( title = "PCA of log10 transformed FPKM values", - subtitle = "point size encodes PC3, color encodes sample" + subtitle = "point size encodes PC3, color encodes condition, point label is replicate" ) + custom_theme(aspect.ratio = 1, legend.position = "bottom") + scale_color_manual(values = rep_len( @@ -742,7 +820,7 @@ For all other feedback, contact the author(s). - ORCID profile: https://orcid.org/0000-0002-0656-1795 - github page: https://github.com/rabioinf -Visit the MPUSP github page at https://github.com/MPUSP for more info on this workflow and other projects. +Visit the MPUSP github page at https://github.com/MPUSP for more info about this workflow and other projects. ## Data accessability diff --git a/workflow/rules/postprocessing.smk b/workflow/rules/postprocessing.smk index cd17f42..c6e6cf1 100644 --- a/workflow/rules/postprocessing.smk +++ b/workflow/rules/postprocessing.smk @@ -74,6 +74,7 @@ rule shift_reads: # ----------------------------------------------------- rule feature_stats: input: + samples=config["samplesheet"], fasta=rules.get_genome.output.fasta, gff_rna=rules.extract_features.output.gff, gff_cds=rules.extract_mRNA_features.output.gff, diff --git a/workflow/scripts/feature_stats.R b/workflow/scripts/feature_stats.R index e8026d6..caea5c9 100644 --- a/workflow/scripts/feature_stats.R +++ b/workflow/scripts/feature_stats.R @@ -20,6 +20,7 @@ suppressWarnings({ # import genome sequence messages <- c("importing genome sequence and annotation") +samplesheet <- snakemake@input[["samples"]] genome_fasta <- snakemake@input[["fasta"]] genome_gff_rna <- snakemake@input[["gff_rna"]] genome_gff_cds <- snakemake@input[["gff_cds"]] @@ -105,6 +106,18 @@ df_stats <- compute_features( ) %>% as_tibble() +# add sample and replicate information to feature table +df_stats <- mutate(df_stats, sample_name = str_remove(sample_name, "\\.bam$")) +df_samples <- read_tsv(samplesheet, show_col_types = FALSE) %>% + dplyr::rename(sample_name = sample) %>% + dplyr::select(sample_name, condition, replicate) +df_stats <- left_join( + df_stats, df_samples, + by = "sample_name" +) +df_stats <- df_stats %>% + relocate(seqnames, group_name, sample_name, condition, replicate) + # EXPORT RESULTS # ------------------------------