PLOT_EXPLORATORY module breaks with large number of samples. #214

kai-lawsonmcdowall · 2023-12-05T12:44:41Z

Description of the bug

When running a large number of samples (~700 in the samplesheet) through the differentialabundance pipeline, the PLOT_EXPLORATORY module will generate an OOM error, even when scaling up RAM usage to 128gb via AWS Batch. The R App generates fine, indicating it isn't necessarily an issue with the inputs.

Jonathan Manning has suggested it may lie in the density function of the boxplot.R specifically.

gplot_densityplot <- function(plotmatrices, experiment, colorby = NULL, palette = NULL, expressiontype = "expression", base_size = 16, palette_name = 'Set1', annotate_samples = FALSE, should_log = NULL) {
  plotdata <- ggplotify(plotmatrices, experiment, colorby, value_type = "density", annotate_samples = annotate_samples, should_log = should_log)
  if (is.null(palette)) {
    ncats <- length(unique(plotdata$colorby))
    palette <- makeColorScale(ncats, palette = palette_name)
  }

  p <- ggplot(data = plotdata) +
    geom_area(aes(x = value, y = density, fill = colorby, color = colorby), alpha = 0.4) +
    scale_fill_manual(name = prettifyVariablename(colorby), values = palette) +
    scale_color_manual(name = prettifyVariablename(colorby), values = palette) +
    ylab("Density") +
    xlab(splitStringToFixedwidthLines(paste0(
      "log2(",
      prettifyVariablename(expressiontype), ")"
    ), 15)) +
    guides(fill = guide_legend(title = prettifyVariablename(colorby))) +
    theme(legend.position = "bottom")

  if (is.list(plotmatrices) && length(plotmatrices) > 1) {
    p <- p + facet_wrap(~type, ncol = 1, scales = "free_y")
  }

  p + theme_bw(base_size = base_size) + theme(
    legend.position = "bottom"
  )
}

Command used and terminal output

I ran the following command from the terminal:


/home/ec2-user/nextflow run /home/ec2-user/differentialabundance/main.nf \
-profile docker \
-bucket-dir  path/to/s3/folder/differential_abundance/work/ \
--outdir  path/to/s3/folder/differential_abundance/results/ \
--input path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.csv \
--matrix  path/to/s3/folder/differential_abundance/inputs/salmon.merged.gene_counts_scaled.tsv \
--contrasts path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_contrasts.csv \
-c /home/ec2-user/awsbatch_5_retries.config \
-resume

The following error occurred:

ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY (Treatment)'

Caused by:
Essential container in task exited - OutOfMemoryError: Container killed due to memory usage

Command executed:

exploratory_plots.R
--sample_metadata "IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.sample_metadata.tsv"
--feature_metadata "matrix_as_anno.feature_metadata.tsv"
--assay_files "salmon.merged.gene_counts_length_scaled.assay.tsv,all.normalised_counts.tsv,all.vst.tsv"
--contrast_variable "Treatment"
--outdir "Treatment"
--sample_id_col "sample" --feature_id_col "gene_id" --assay_names "raw,normalised,variance_stabilised" --final_assay "variance_stabilised" --outlier_mad_threshold -5 --palette_name "Set1"

cat <<-END_VERSIONS > versions.yml
"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY":
r-base: $(echo $(R --version 2>&1) | sed 's/^.R version //; s/ .$//')
r-shinyngs: $(Rscript -e "library(shinyngs); cat(as.character(packageVersion('shinyngs')))")
END_VERSIONS

Command exit status:
137

Command output:
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"

Command error:
(more omitted..)
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Attaching package: 'shinyngs'
The following object is masked from 'package:MatrixGenerics':
colMedians
The following object is masked from 'package:matrixStats':
colMedians
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"
.command.sh: line 8: 223 Killed



### Relevant files

I launched my pipeline from `home/ec2-user/differentalabundance`. Unfortunately, I cannot provide full files due to the proprietary nature of the data. However, the matrix file is a `salmon.merged.gene_counts_scaled.tsv file`, which is 200mb, 770 columns and 29,745 rows. of the format:

gene_id	gene_name	X1A_21_A1_GW_IPSC_CNS_4U_24h_2021	X1A_21_A10_GW_IPSC_CNS_4U_24h_2021
A1BG	A1BG	30.7871312864341	83.588759118784
A1BG-AS1	A1BG-AS1	0	2.57252569758649
A1CF	A1CF	0	0
A2M	A2M	1.56513798198313	3.11160444362184
A2M-AS1	A2M-AS1	0	0.554415060204014
A2ML1	A2ML1	0	0
A2MP1	A2MP1	0	0
A3GALT2	A3GALT2	0	0
A4GALT	A4GALT	0	0
A4GNT	A4GNT	0	0
AA06	AA06	0	0
AAAS	AAAS	9.82407546152402	33.6596479800078
AACS	AACS	3.6887964454676	7.56602734382506


My contrast file contains the following:

id	variable	reference	target
GWP42003_1_vs_GWP42003_2	Treatment	GWP42003_1	GWP42003_2
DMSO_vs_GWP42003_1	Treatment	DMSO	GWP42003_1
DMSO_vs_GWP42003_2	Treatment	DMSO	GWP42003_2

For the sake of running the pipeline, the sample sheet looks something like the below, except with a total of 769 rows (samples) 48 of which are GWP42003_2, 48 GWP42003_2, and 64 DMSO, leading to 160 total samples that the contrast file should pick up

sample	fastq_1	strandedness	Plate	Well	Treatment	Concentration	TimePoint	CellsWell	Study	Platform	Protocol	Paired	ConcentrationUnit	StudyType	Status	Excluded	Duration
1A_21_A1_GW_IPSC_CNS_4U_24h_2021	s3path/to/fastq/A1.fastq.gz	auto	1A_21	A1	BUFFER	0	24h	10k	GW_IPSC_CNS_4U_24h_2021	plateRNA-Seq	plateRNA-Seq	Y	uM	Invitro	Analyzed	N	21
1A_21_A10_GW_IPSC_CNS_4U_24h_2021	s3path/to/fastq/A10.fastq.gz	auto	1A_21	A10	GWP42003_1	0.1	24h	10k	GW_IPSC_CNS_4U_24h_2021	plateRNA-Seq	plateRNA-Seq	Y	uM	Invitro	Analyzed	N	21
1A_21_A11_GW_IPSC_CNS_4U_24h_2021	s3path/to/fastq/A11.fastq.gz	auto	1A_21	A11	GWP42003_2	0.1	24h	10k	GW_IPSC_CNS_4U_24h_2021	plateRNA-Seq	plateRNA-Seq	Y	uM	Invitro	Analyzed	N	21


### System information

Running via AWS EC2 instance using CentOS
Running via AWS Batch. 
using the Docker Profile 
NextFlow version 23.10.0, build 5889 
differential abundance v1.4.0

The text was updated successfully, but these errors were encountered:

pinin4fjords · 2024-01-08T09:23:04Z

As part of #221 I reduced the granularity of density calculation and made a couple of other tweaks that might make a small difference.

Fundamentally though, this pipeline is not optimised for large sample numbers, and we need to think what that might look like. We probably need to make some plots/ report components conditional on smaller sample numbers, or use different summary statistics across sample groups.

This won't be fixed imminently, but we'll keep it on the TODO list.

kai-lawsonmcdowall added the bug Something isn't working label Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLOT_EXPLORATORY module breaks with large number of samples. #214

PLOT_EXPLORATORY module breaks with large number of samples. #214

kai-lawsonmcdowall commented Dec 5, 2023 •

edited

pinin4fjords commented Jan 8, 2024

PLOT_EXPLORATORY module breaks with large number of samples. #214

PLOT_EXPLORATORY module breaks with large number of samples. #214

Comments

kai-lawsonmcdowall commented Dec 5, 2023 • edited

Description of the bug

Command used and terminal output

pinin4fjords commented Jan 8, 2024

kai-lawsonmcdowall commented Dec 5, 2023 •

edited