Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLOT_EXPLORATORY module breaks with large number of samples. #214

Open
kai-lawsonmcdowall opened this issue Dec 5, 2023 · 1 comment
Open
Labels
bug Something isn't working

Comments

@kai-lawsonmcdowall
Copy link

kai-lawsonmcdowall commented Dec 5, 2023

Description of the bug

When running a large number of samples (~700 in the samplesheet) through the differentialabundance pipeline, the PLOT_EXPLORATORY module will generate an OOM error, even when scaling up RAM usage to 128gb via AWS Batch. The R App generates fine, indicating it isn't necessarily an issue with the inputs.

Jonathan Manning has suggested it may lie in the density function of the boxplot.R specifically.

gplot_densityplot <- function(plotmatrices, experiment, colorby = NULL, palette = NULL, expressiontype = "expression", base_size = 16, palette_name = 'Set1', annotate_samples = FALSE, should_log = NULL) {
  plotdata <- ggplotify(plotmatrices, experiment, colorby, value_type = "density", annotate_samples = annotate_samples, should_log = should_log)
  if (is.null(palette)) {
    ncats <- length(unique(plotdata$colorby))
    palette <- makeColorScale(ncats, palette = palette_name)
  }

  p <- ggplot(data = plotdata) +
    geom_area(aes(x = value, y = density, fill = colorby, color = colorby), alpha = 0.4) +
    scale_fill_manual(name = prettifyVariablename(colorby), values = palette) +
    scale_color_manual(name = prettifyVariablename(colorby), values = palette) +
    ylab("Density") +
    xlab(splitStringToFixedwidthLines(paste0(
      "log2(",
      prettifyVariablename(expressiontype), ")"
    ), 15)) +
    guides(fill = guide_legend(title = prettifyVariablename(colorby))) +
    theme(legend.position = "bottom")

  if (is.list(plotmatrices) && length(plotmatrices) > 1) {
    p <- p + facet_wrap(~type, ncol = 1, scales = "free_y")
  }

  p + theme_bw(base_size = base_size) + theme(
    legend.position = "bottom"
  )
}

image_720

Command used and terminal output

I ran the following command from the terminal:


/home/ec2-user/nextflow run /home/ec2-user/differentialabundance/main.nf \
-profile docker \
-bucket-dir  path/to/s3/folder/differential_abundance/work/ \
--outdir  path/to/s3/folder/differential_abundance/results/ \
--input path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.csv \
--matrix  path/to/s3/folder/differential_abundance/inputs/salmon.merged.gene_counts_scaled.tsv \
--contrasts path/to/s3/folder/differential_abundance/inputs/IPSC_21_all_cpd_samples_contrasts.csv \
-c /home/ec2-user/awsbatch_5_retries.config \
-resume 

The following error occurred:

ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY (Treatment)'

Caused by:
Essential container in task exited - OutOfMemoryError: Container killed due to memory usage

Command executed:

exploratory_plots.R
--sample_metadata "IPSC_21_all_cpd_samples_samplesheet_mod_for_diff_abundance.sample_metadata.tsv"
--feature_metadata "matrix_as_anno.feature_metadata.tsv"
--assay_files "salmon.merged.gene_counts_length_scaled.assay.tsv,all.normalised_counts.tsv,all.vst.tsv"
--contrast_variable "Treatment"
--outdir "Treatment"
--sample_id_col "sample" --feature_id_col "gene_id" --assay_names "raw,normalised,variance_stabilised" --final_assay "variance_stabilised" --outlier_mad_threshold -5 --palette_name "Set1"

cat <<-END_VERSIONS > versions.yml
"NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:PLOT_EXPLORATORY":
r-base: $(echo $(R --version 2>&1) | sed 's/^.R version //; s/ .$//')
r-shinyngs: $(Rscript -e "library(shinyngs); cat(as.character(packageVersion('shinyngs')))")
END_VERSIONS

Command exit status:
137

Command output:
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"

Command error:
(more omitted..)
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Attaching package: 'shinyngs'
The following object is masked from 'package:MatrixGenerics':
colMedians
The following object is masked from 'package:matrixStats':
colMedians
[1] "Reading inputs..."
[1] "Creating output paths..."
[1] "Writing boxplots..."
[1] "... static"
null device
1
[1] "Writing density plots..."
[1] "... static"
.command.sh: line 8: 223 Killed



### Relevant files

I launched my pipeline from `home/ec2-user/differentalabundance`. Unfortunately, I cannot provide full files due to the proprietary nature of the data. However, the matrix file is a `salmon.merged.gene_counts_scaled.tsv file`, which is 200mb, 770 columns and 29,745 rows. of the format:

gene_id gene_name X1A_21_A1_GW_IPSC_CNS_4U_24h_2021 X1A_21_A10_GW_IPSC_CNS_4U_24h_2021
A1BG A1BG 30.7871312864341 83.588759118784
A1BG-AS1 A1BG-AS1 0 2.57252569758649
A1CF A1CF 0 0
A2M A2M 1.56513798198313 3.11160444362184
A2M-AS1 A2M-AS1 0 0.554415060204014
A2ML1 A2ML1 0 0
A2MP1 A2MP1 0 0
A3GALT2 A3GALT2 0 0
A4GALT A4GALT 0 0
A4GNT A4GNT 0 0
AA06 AA06 0 0
AAAS AAAS 9.82407546152402 33.6596479800078
AACS AACS 3.6887964454676 7.56602734382506

My contrast file contains the following:

id variable reference target blocking
GWP42003_1_vs_GWP42003_2 Treatment GWP42003_1 GWP42003_2  
DMSO_vs_GWP42003_1 Treatment DMSO GWP42003_1  
DMSO_vs_GWP42003_2 Treatment DMSO GWP42003_2  
For the sake of running the pipeline, the sample sheet looks something like the below, except with a total of 769 rows (samples) 48 of which are GWP42003_2, 48 GWP42003_2, and 64 DMSO, leading to 160 total samples that the contrast file should pick up 

sample fastq_1 fastq_2 strandedness Plate Well Treatment Concentration TimePoint CellsWell Study Platform Protocol Paired ConcentrationUnit StudyType Status Excluded Duration
1A_21_A1_GW_IPSC_CNS_4U_24h_2021 s3path/to/fastq/A1.fastq.gz   auto 1A_21 A1 BUFFER 0 24h 10k GW_IPSC_CNS_4U_24h_2021 plateRNA-Seq plateRNA-Seq Y uM Invitro Analyzed N 21
1A_21_A10_GW_IPSC_CNS_4U_24h_2021 s3path/to/fastq/A10.fastq.gz   auto 1A_21 A10 GWP42003_1 0.1 24h 10k GW_IPSC_CNS_4U_24h_2021 plateRNA-Seq plateRNA-Seq Y uM Invitro Analyzed N 21
1A_21_A11_GW_IPSC_CNS_4U_24h_2021 s3path/to/fastq/A11.fastq.gz   auto 1A_21 A11 GWP42003_2 0.1 24h 10k GW_IPSC_CNS_4U_24h_2021 plateRNA-Seq plateRNA-Seq Y uM Invitro Analyzed N 21

### System information

Running via AWS EC2 instance using CentOS
Running via AWS Batch. 
using the Docker Profile 
NextFlow version 23.10.0, build 5889 
differential abundance v1.4.0
@kai-lawsonmcdowall kai-lawsonmcdowall added the bug Something isn't working label Dec 5, 2023
@pinin4fjords
Copy link
Member

As part of #221 I reduced the granularity of density calculation and made a couple of other tweaks that might make a small difference.

Fundamentally though, this pipeline is not optimised for large sample numbers, and we need to think what that might look like. We probably need to make some plots/ report components conditional on smaller sample numbers, or use different summary statistics across sample groups.

This won't be fixed imminently, but we'll keep it on the TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants