Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the pipeline run with different file sizes #108

Open
szymonwieloch opened this issue Feb 4, 2021 · 1 comment
Open

Make the pipeline run with different file sizes #108

szymonwieloch opened this issue Feb 4, 2021 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@szymonwieloch
Copy link

szymonwieloch commented Feb 4, 2021

Hi! I have a problem with running this pipeline. It seems that it incorrectly chooses memory requirements for files. This is especially problematic with very big files. My biggest file during tests was 16 GB, but in the future we may have much bigger. Such a size requires 256 GB of memory for the run_optitype process.

My issue is that the hlatyping pipeline does not allow you by default to handle such a big files. The only workaround that I found was creation of a additional configuration file extra.config and passing it to the nextflow with the -c parameter to override the default configuration. My expectation is that the pipeline should allow you to process your data only using command line parameters. This didnt work because:

1. Problems with setting maxRetries

For some strange reason when I tried to increase retries with -process.maxRetries 5 it didn't work and the default value of 1 was used. When I tried to set 'maxRetries = 5' in the extra.config file, for some strange reason I saw 2 retries. All failing processes were finishing with the 137 error code and should be retried 5 times with increasing memory. I am not sure if this is a problem with this pipeline or with NextFlow, however I haven't experienced such problems with other pipelines.

2. Slow memory adaptation mechanism

Te current memory adaptation mechanism is extremely slow:

memory = { check_max( 7.GB * task.attempt, 'memory' ) }

To reach required 256 GB of RAM for my samples it would require 37 retries. To process a 50 GB sample file - around 116. There are two good approaches to fix that:

A. Change the algorithm and use an exponential adaptation algorithm:

memory = {8.GB * (2^(task.attempt - 1))}

This would only require 6 retries for 16 GB file, 8 retries for a 50 GB file, and wouldn't cause huge resource overhead.

B. Calculate memory requirement from the input file size.

The task object should give you access to the input files. This allows you to check the sample size and calculate the amount of required memory. I suspect that there is a linear relation between the input file size and the actual memory requirement. A simple linear equation should give your a precise amount of memory needed to process the given sample. This approach requires more work: obtaining real memory usage for several samples and checking the actual relationship but eventually no retries would be needed.

@christopher-mohr
Copy link
Collaborator

Hi @szymonwieloch, thanks for reporting and providing detailed information on this. We will check this and get back to you.

@christopher-mohr christopher-mohr added the enhancement New feature or request label Feb 4, 2021
@christopher-mohr christopher-mohr added this to the 1.1.3 milestone Feb 4, 2021
@apeltzer apeltzer modified the milestones: 1.1.3, 2.0 Mar 17, 2021
@christopher-mohr christopher-mohr modified the milestones: 2.0, 2.1 Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants