Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize analysis sections #400

Closed
adamcantor22 opened this issue May 18, 2022 · 3 comments · May be fixed by #472
Closed

Parallelize analysis sections #400

adamcantor22 opened this issue May 18, 2022 · 3 comments · May be fixed by #472
Labels
AnalysisTools Server Issue relates to the server
Milestone

Comments

@adamcantor22
Copy link
Member

Is your feature request related to a problem? Please describe.
The current solution for multiple demux/denoise runs per analysis is to have them run in series. This is quite inefficient for larger studies, and may need to be submitted to -q long in order to run successfully.

Describe the solution you'd like
These should be able to be done in parallel, probably using a qiime1-esque solution in which output files are checked for. When all the output files exist, the main job can be started which will merge all the sub-components. This solution should be generalized enough that it could potentially be used for other parallelization (e.g. with a new ANCOM implementation #386).

Describe alternatives you've considered
We briefly discussed multi-threading, but quickly dismissed it as it would raise our code's complexity by quite a lot.

@adamcantor22 adamcantor22 added Server Issue relates to the server AnalysisTools labels May 18, 2022
@adamcantor22 adamcantor22 added this to the 0.9.0 milestone May 18, 2022
@cleme
Copy link
Member

cleme commented May 18, 2022

Q1 used to have a solution based on: a job is submitted spawning worker sub-jobs, which do the computation, while the main job remains waiting until all output files have been created. Details here:

https://github.com/biocore/qiime/tree/master/qiime/parallel

poller.py and util.py have most of the functionality that we would require. This solution is not ideal, because when worker jobs do not complete, there is no way for the main job to "know" the files won't be created and it keeps waiting until it hits walltime. It might be worth to review how Q2 implements parallelization.

@cleme cleme modified the milestones: 0.9.0, 1.0.0 May 18, 2022
@adamcantor22 adamcantor22 modified the milestones: 0.10.0, 0.12.0 Oct 12, 2022
@adamcantor22
Copy link
Member Author

adamcantor22 commented Dec 1, 2023

While "full" parallelization is a challenging issue, there are a number of simple changes we could make to parallelize sections. This includes parallelizing differential abundance testing, taxa summarizing, and most importantly, demux/denoising. When there are many sequencing runs in a study, this step is much more serialized than it needs to be. Each run imports the fastqs to qiime artifact, demuxes, and denoises sequentially, then moves to the next run. These individual steps can be safely run in parallel across all runs. I.e., all fastq imports run in parallel, then all demuxes run in parallel, then all denoises. This will significantly speed these runs up. It may be challenging to do this when working with runs of different types (e.g. single vs dual barcodes) but at least, this can be applied to runs of the same type.

@adamcantor22
Copy link
Member Author

superceded by snakemake, which has this functionality #457

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AnalysisTools Server Issue relates to the server
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants