Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL).
- Docker images used by this workflow are defined in the wdl-dockerfiles repo. Images are hosted in PacBio's quay.io repo.
- Common tasks that may be reused within or between workflows are defined in the wdl-common repo.
Starting in v2, this repo contains two related workflows. The singleton
workflow is designed to analyze a single sample, while the family
workflow is designed to analyze a family of related samples. With the exception of the joint calling tasks in the family
workflow, both workflows make use of the same tasks, although the input and output structure differ.
The family
workflow will be best for most use cases. The singleton
workflow inputs and output structures are relatively flat, which should improve compatibility with platforms like Terra.
Both workflows are designed to analyze human PacBio whole genome sequencing (WGS) data. The workflows are designed to be run on Azure, AWS HealthOmics, GCP, or HPC backends.
Workflow entrypoint:
This is an actively developed workflow with multiple versioned releases, and we make use of git submodules for common tasks that are shared by multiple workflows. There are two ways to ensure you are using a supported release of the workflow and ensure that the submodules are correctly initialized:
- Download the release zips directly from a supported release:
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.1/hifi-human-wgs-singleton.zip
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.1/hifi-human-wgs-family.zip
- Clone the repository and initialize the submodules:
git clone \
--depth 1 --branch v2.0.1 \
--recursive \
https://github.com/PacificBiosciences/HiFi-human-WGS-WDL.git
The most resource-heavy step in the workflow requires 64 cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.
On some backends, you may be able to make use of a GPU to accelerate the DeepVariant step. The GPU is not required, but it can significantly speed up the workflow. If you have access to a GPU, you can set the gpu
parameter to true
in the inputs JSON file.
Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.
- Select a backend environment
- Configure a workflow execution engine in the chosen environment
- Fill out the inputs JSON file for your cohort
- Run the workflow
The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.
For backend-specific configuration, see the relevant documentation:
An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl
and Cromwell
.
Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.
See the backend-specific documentation for details on setting up an engine.
Engine | Azure | AWS | GCP | HPC |
---|---|---|---|---|
miniwdl | Unsupported | Supported via AWS HealthOmics | Unsupported | (SLURM only) Supported via the miniwdl-slurm plugin |
Cromwell | Supported via Cromwell on Azure | Unsupported | Supported via Google's Pipelines API | Supported - Configuration varies depending on HPC infrastructure |
The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:
- HPC singleton entrypoint
- HPC family entrypoint
- AWS singleton entrypoint
- AWS family entrypoint
- Azure singleton entrypoint
- Azure family entrypoint
- GCP singleton entrypoint
- GCP family entrypoint
Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).
If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix>
in the input template file with the local path to the reference datasets on your HPC. If using Amazon HealthOmics, you will need to download the reference bundle, upload it to your S3 bucket, and adjust paths accordingly.
Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).
Note that the calls to miniwdl
and Cromwell
assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).
miniwdl run workflows/singleton.wdl -i <input_file_path.json>
java -jar <cromwell_jar_path> run workflows/singleton.wdl -i <input_file_path.json>
If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:
This section describes the inputs required for a run of the workflow. Typically, only the sample-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.
Workflow inputs for each entrypoint are described in singleton and family documentation.
At a high level, we have two types of inputs files:
- maps are TSV files describing inputs that will be used for every execution of the workflow, like reference genome FASTA files and genome interval BED files.
- inputs.json files are JSON files that describe the samples to be analyzed and the paths to the input files for each sample.
The resource bundle containing the GRCh38 reference and other files used in this workflow can be downloaded from Zenodo:
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io repo. Docker images used in the workflow are pinned to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task. Images can be referenced in the following table by looking for the name after the final /
character and before the @sha256:...
. For example, the image referred to here is "align_hifiasm":
~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87
Tool versions and Docker images used in these workflows can be found in the tools and containers documentation.
TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.