Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: CLI, Docker, and More Download Types #56

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

ktmeaton
Copy link

@ktmeaton ktmeaton commented Sep 4, 2024

Hi @Wytamma,

I'm adapting your package for a Nextflow pipeline, with a priority focus on mpox. I implemented several new features that I wanted to propose to you! Broadly speaking, this pull request adds:

  1. A Command-Line Interface
  2. A Docker Image.
  3. More Download Types, such as Sequencing Technology metadata.
  4. Additional Continuous Integration tests and builds.

Continuous Integration for the latest commit 49fe926: https://github.com/ktmeaton/GISAIDR/actions/runs/10688769327

Issues Resolved

I think this pull request will resolve, or at least address, the following issues:

Changes

  1. A command-line interface: bin/GISAIDR. See CLI Usage section.

    • Provides querying, filtering, and downloading with customized batch sizes.
    • Windows users can either directly call bin/GISAIDR.bator Rscript bin/GISAIDR.
    • To the best of my knowledge, I have implemented all your API features as of v0.9.10.
  2. New R dependency optparse.

    • Required to parse command-line arguments in an elegant manner.
  3. New function download_files in R/download.R.

    • Generalizes the original download function to work on any of GISAID's provided data files including: Augur Input, Dates and Location, Patient Status, Sequencing Technology, and Sequences.
    • Left the original download function mostly intact, for backwards compatibility. (Just added slight logging change).
  4. New Download Types.

    • Augur Input (EpiCoV Only)
    • Dates and Location
    • Patient Status
    • Sequencing Technology
    • Sequences
  5. New functions log.info, log.error, log.warn in R/core.R

    • Modelled after the log.debug function. log.info also allows for different verbosity levels.
    • Setting --verbosity 2 is helpful for users who want extra logging output, without invoking the full --debug logs of all the HTTP requests.
  6. New parameter subtype for R/query.R

    • Allows filtering on subtype ("A", "B") for EpiRSV.
  7. Docker image via Dockerfile

    • Builds a docker image that includes the GISAIDR R package and CLI.
    • The image temporarily built on my fork is ktmeaton/gisaidr:cli
  8. Tests

    • Added query tests to cover unique aspects of EpiPox and EpiRSV (such as how complete and high_quality work a bit different )
    • Added download tests to cover the new download_files function.
  9. Continous Integration

    • Added ubuntu-latest as an operating system for the Build workflow.
    • Renamed the build job to test to reflect that the steps primarily test for errors.
    • Added a new job docker. Builds the R package and CLI into a Docker image. If the branch is master or starts with a v (ex. v0.10.0), the CI job will also push the image to container registries. By default, this will just be the GitHub package registry (ex. https://github.com/ktmeaton/GISAIDR/pkgs/container/gisaidr). But if the repository has defined the secrets DOCKER_USERNAME and DOCKER_PASSWORD, it will also push to DockerHub (ex. https://hub.docker.com/r/ktmeaton/gisaidr/tags). The master branch will update the latest image tag.

CLI Usage

Setup

  1. Create a credentials.yml file.

    GISAIDR_USERNAME: "yourUsername"
    GISAIDR_PASSWORD: "yourPassword"

    Technically, these can still come from environment variables. But passing env variables to Docker and Singularity without exposing them is tricky. So I've added the option to store them in a file.

  2. Save some test accessions.

    echo -e "EPI_ISL_19361895\nEPI_ISL_19361894\nEPI_ISL_19361893" > epicov_accessions.txt

Local

  1. Install the package to add the dependency optparse.

    Rscript -e "devtools::install('.')"
  2. Preview usage.

    # Linux, Mac
    bin/GISAIDR --help
    
    # Windows
    bin/GISAIDR.bat --help
  3. Download EpiCoV test data based on accessions.

    bin/GISAIDR \
      --credentials credentials.yml \
      --database EpiCoV \
      --prefix EpiCoV \
      --accessions epicov_accessions.txt \
      --sequences \
      --augur-input \
      --dates-and-location \
      --patient-status \
      --sequencing-technology
    • This example demonstrates all the new download options.
    • Outputs EpiCoV.sequences.fasta and EpiCoV.metadata.tsv which is a join of all the different tables (ex. Dates and Location, Patient Status, etc.).

Docker

  1. Preview usage.

    docker run -v $(pwd):/tmp ktmeaton/gisaidr:cli GISAIDR --help
  2. Download EpiPox test data based on a query.

    docker run -v $(pwd):/tmp ktmeaton/gisaidr:cli GISAIDR \
      --credentials credentials.yml \
      --database EpiPox \
      --prefix EpiPox \
      --dates-and-location \
      --location Canada \
      --from-subm 2024-04-01 --to-subm 2024-06-01 \
      --max-records 3
    • The docker container starts in it's own /tmp directory.
    • Mount the current working directory to /tmp so that GISAIDR can access our credentials.yml and write output metadata and sequences to it.
    • This example ONLY downloads the Dates and Location metadata to EpiPox.metadata.tsv. No sequences will be downloaded because --sequences was not requested.

Singularity

  1. Pull image.

    singularity pull docker://ktmeaton/gisaidr:cli
    • Generates a lot of warn rootless messages on pull. Comes from the micromamba base image.
  2. Preview usage.

    singularity run docker://ktmeaton/gisaidr:cli GISAIDR --help
  3. Download test data from EpiRSV.

    singularity run -B $(pwd):/tmp --pwd /tmp docker://ktmeaton/gisaidr:cli GISAIDR \
      --credentials credentials.yml \
      --database EpiRSV \
      --prefix EpiRSV \
      --patient-status \
      --location "South America / Chile / Region Metropolitana de Santiago" \
      --from 2024-07-01 --to 2024-08-01 \
      --max-records 5
    • Set the singularity image to start in the /tmp directory to match Docker.
    • Mount the current working directory to /tmp so that GISAIDR can access our credentials.yml and write output metadata and sequences to it.

Nextflow

If you want to use the CLI in Nextflow with a singularity runtime, you need to do special handling for the executable command. Because the micromamba base image integrates with singularity in an odd way.

process GISAIDR {
    tag "$db"
    label 'process_single'
    label 'GISAIDR'

    publishDir path: "${params.outdir}/gisaidr", mode: "copy", overwrite: true

    container "docker.io/ktmeaton/gisaidr:cli"

    input:
    val(db)
    path(credentials)

    output:
    path("*metadata.tsv")   , emit: metadata
    path("*sequences.fasta"), emit: sequences
    path("*.log")           , emit: log
    path("versions.yml")    , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args   = task.ext.args ?: ''
    // Special handling of using executables based on a docker micromamba image
    // https://stackoverflow.com/a/78027234
    // https://micromamba-docker.readthedocs.io/en/latest/faq.html#how-can-i-use-a-mambaorg-micromamba-based-image-with-apptainer
    def run_cmd = workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ? '/usr/local/bin/_entrypoint.sh GISAIDR' : 'GISAIDR'
    """
    $run_cmd \\
      $args \\
      --database $db \\
      --credentials $credentials \\
       --sequences \\
       --patient-status \\
      2>&1 | tee GISAIDR.log

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        GISAIDR: \$( $run_cmd --version | cut -d " " -f 2 | sed 's/v//g')
    END_VERSIONS
    """

Temporary Changes

I have made the following temporary changes, that should be reverted before merging:

  • Disabled the scheduled CI job that runs daily.
  • Enabled CI to run on all branches, not just master.
  • Enabled container registry push on the cli branch. Should be reverted to just master and version tagged releases v*.
  • Removed line wrapping in functions calls in R/internal_query.R to make it easier to read. Might be a linting violation now though, I can restore the original formatting.

@ktmeaton
Copy link
Author

ktmeaton commented Sep 4, 2024

The continuous integration run for the latest commit (49fe926) is at: https://github.com/ktmeaton/GISAIDR/actions/runs/10688769327

The rcmdcheck step takes almost an hour on each operating system 😓 I'm not sure if that's a typical runtime, or if there's throttling based on credentials.

@Wytamma
Copy link
Owner

Wytamma commented Sep 5, 2024

Wow! What an amazing contribution @ktmeaton! I'm at a conference today but will review this asap. You're a legend 🎉

@ktmeaton
Copy link
Author

ktmeaton commented Sep 5, 2024

It's a lot of content, my battle with Nextflow got a little out of hand and I kept adding more 😪 but I hope there's a couple pieces here that you might also find useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants