HPC-GVCW

Principal investigators (PI)

Prof. Rod A. Wing,
Director, Center for Desert Agriculture,
Professor, Biological and Environmental Science and Engineering,
4700 King Abdullah University of Science and Technology,
Thuwal 23955-6900,
Kingdom of Saudi Arabia

For Pipeline support

Contact us: [email protected]

Authors:

Nagarajan Kathiresan {[email protected]}
Yong Zhou {[email protected]}
Zhichao Yu {[email protected] }
Luis F. Rivera Serna {[email protected]}
Manjula Thimma {[email protected]}
Keerthana Manickam {[email protected]}
Rod A Wing {[email protected], [email protected]}

Publication:

DOI: https://doi.org/10.1186/s12915-024-01820-5
PDF available here: https://link.springer.com/content/pdf/10.1186/s12915-024-01820-5.pdf.

Computational systems

About Shaheen

The system has 6,174 dual sockets compute nodes based on 16 core Intel Haswell processors running at 2.3GHz. Each node has 128GB of DDR4 memory running at 2300MHz. Overall the system has a total of 197,568 processor cores and 790TB of aggregate memory. More information is available in https://www.hpc.kaust.edu.sa/content/shaheen-ii

About Ibex cluster

Ibex is a heterogeneous group of nodes, a mix of AMD, Intel and Nvidia GPUs with different architectures that gives the users a variety of options to work on. Overall, Ibex is made up of 488+ nodes togeter has a heterogeneous cluster and the workload is managed by the SLURM scheduler. More information is available in https://www.hpc.kaust.edu.sa/ibex

Workflow for Rice Variant Calling

Required Software

The following software are used and tested for HPC-GVCW.

bwa 0.7.17
samtools 1.8
gatk 4.1.6.0 and
tabix 0.2.6

Phase #1 - Data pre-processing
The objective of this phase is to get the clean data from the collected rice genome samples. This includes, (a) Genome alignment using BWA MEM algorithm, (b) Update FixMate reads for the same set of genomes, Mark Duplicate and Read grouping using Genome Analysis ToolKit (GATK).

Phase #2 - Variant discovery
The objective of this phase is to call the variants per sample and generate gVCFs files. Two major steps are required in this variant discovery phase. First, the multiple sorted input files are merged into single BAM file and (re)sorted to the merged BAM using SAMTools. Second step is to call the SNPs and INDELs simultaneously via local denovo-assembly of haplotypes in an active region using GATK called “HaplotypeCaller”. At this end of this phase, we will generate a gVCF output of SNPs and INDELs.

Phase #3 - Callset refinement
In this phase, we will combine all gVCF files from the HaplotypeCaller and generate joint genotyping across all the samples. This phase is extremely complex because of (i) Multiple samples executed across the cluster of nodes in phase #1 and phase #2 are combined (using GATK CombineGVCFs) into a single file and then, generate multi-sample joint genotyping (using GATK GenotypeGVCFs) and (ii) the CombineGVCFs and GenotypeGVCFs steps are executed in a single core using GATK.
As we know, the GATK tool is sequential due to programming limitations and the assembling of genotype across multiple samples into a single file takes extremely longer time and required huge memory when the data parallelization is absent. To address these limitations, the latest version of GATK offers variant intervals feature in CombineGVCFs and GenotypeGVCFs calls for data parallelization.
Phase #4 - Variant tables
In this phase, the quality of genotype is enriched through variant filters and it’s also separated based on SNPs and INDELs from these independent chunks of GenotypeGVCFs files. Once all the chunks of filtered SNPs and INDELs are generated, all these partial chunks can be combined into a single file using GatherVcfs and its recommended to assemble per chromosome. The chromosome-based SNPs and INDELs are converted into variant table.

Summary of workflow steps across multiple phases

The below table summarizes various bioinformatics tools used in different stages of the workflow. Additionally, we provided the optimal number of CPUs used, data parallelization methods and input/output file formats are summarized.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Docker		Docker
For_HPC		For_HPC
For_Hybrid_Cluster		For_Hybrid_Cluster
Singularity		Singularity
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPC-GVCW

Principal investigators (PI)

For Pipeline support

Authors:

Publication:

Computational systems

About Shaheen

About Ibex cluster

Workflow for Rice Variant Calling

Required Software

Summary of workflow steps across multiple phases

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

IBEXCluster/HPC-GVCW

Folders and files

Latest commit

History

Repository files navigation

HPC-GVCW

Principal investigators (PI)

For Pipeline support

Authors:

Publication:

Computational systems

About Shaheen

About Ibex cluster

Workflow for Rice Variant Calling

Required Software

Summary of workflow steps across multiple phases

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages