Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage of facets #162

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Reduce memory usage of facets #162

wants to merge 2 commits into from

Conversation

natir
Copy link

@natir natir commented Jun 21, 2021

Hi,

In my laboratory we use facets on many whole genome human dataset, on this data facets have a huge memory usage, approximately 150 Gib.

The purpose of this PR is to try to reduce facets memory usage, for this I replace some classic R data.frame by tidyverse tibble data-structure, I also use tydiverse pipe syntaxe to perform some operation on this tibble.

With all this change, I divide memory usage by 2.

On my test dataset result is same between my PR and version v0.6.1, but maybe I miss some stuff.

I'm not a good R developer, maybe I include some stupid mistake, so if you want just take the idea of my change and rewrite it please do it.

Thank

@veseshan
Copy link
Collaborator

Can you give me some breakdown of where this memory explosion occurs. My back of the envelope calculation says

R:> x = rnorm(12e6) # one locus every 250 bases across 3000 Megabase
R:> format(object.size(x), units="Mb")
[1] "91.6 Mb"

The jointseg data frame has 16 columns but even that wouldn't translate to 150Gib memory use.

Have you tried using the readSnpMatrixDT.R in path/facets/extRfns/ to read in the data?

Thanks

@natir
Copy link
Author

natir commented Jun 22, 2021

With v0.6.1 the memory peak is during file reading, use readSnpMatrixDT.R like my change solve this issue.

But another peak occur during preProcSample I assume, it's more specifically in procSnps (some duplication, column creation, calling of Fortran code and filtration not run in place).

With v0.6.1 and readSnpMatrixDT.R memory usage is 85Gib, my version use 70Gib.

@veseshan
Copy link
Collaborator

Can you tell me how big is the pileup matrix i.e. how many loci? And how many end up in jointseg? Thanks.

@natir
Copy link
Author

natir commented Jun 23, 2021

The pileup matrix contains 546,700,164 loci.

To evaluate number of jointseg I consider $jointseg in output produce by procSample, I get 5,583,831 jointseg.

@veseshan
Copy link
Collaborator

Given that the whole genome is around 3 Gigabase, the pileup seems to have a locus every 6 bases. That is a lot of redundant data as they will be highly serially correlated. You can DM me if you want to talk about this further.

I will look into how your code can be used to reduce the memory use of procSnps.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants