Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile ncov-ingest #240

Closed
joverlee521 opened this issue Nov 30, 2021 · 3 comments · Fixed by #451
Closed

Profile ncov-ingest #240

joverlee521 opened this issue Nov 30, 2021 · 3 comments · Fixed by #451
Assignees
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Currently, GISAID ingest takes ~4 hours. We should profile the pipeline to figure out if improvements can be made without a major overhaul (i.e. incremental ingest with caches or a database).

  • Once the pipeline has been converted to Snakemake via Snakemake pipeline #231, we can use Snakemake's benchmark, --stats, and/or --report to get an overview of the pipeline. We should upload the outputs to S3 to have a record of changes over time.
  • @tsibley suggested using Python profilers such as py-spy to inspect specific scripts that are pain points in the pipeline.
@joverlee521 joverlee521 added the enhancement New feature or request label Nov 30, 2021
@jameshadfield
Copy link
Member

Another aspect of profiling is the removal / storage of (large) files. This is applicable both for storage space while running, as well as the behavior of nextstrain build --aws-batch which will zip and upload the current working directory to one of our S3 buckets.

As of 45dcea6, currently we use snakemake's temp() and the `./bin/clean script (which is controlled by a config argument).

@jameshadfield
Copy link
Member

PR #231 has been merged and I've just started a run with this GitHub action.

@joverlee521
Copy link
Contributor Author

Using --report is dependent on nextstrain/docker-base#219

joverlee521 added a commit that referenced this issue Jun 17, 2024
Adding as part of #240 to help collect more data for tackling #446.

One unexpected behavior that I ran into when testing the `--stats`
option is that Snakemake doesn't generate the stats file if the
workflow exits with an error at any step.

Note that the Snakemake `--stats` option is not available starting with
Snakemake v8, so this will need to be removed when we eventually
upgrade Snakemake in our runtimes.
joverlee521 added a commit that referenced this issue Jun 17, 2024
Adding as part of #240 to help collect more data for tackling #446.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

Successfully merging a pull request may close this issue.

2 participants