Skip to content

rknx/prok-snptree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prok-SNPTree

DOI

Prok-SNPTree is a pipeline to generate phylogenetic tree using core substitutions between reference genome and whole genomes Illumina sequencing.
Prok-SNPTree was originally designed and optimized for small prokaryotic genomes, but can perform reasonably well with larger genomes up to 100 Mb.

Prok-SNPtree comes with a helper file for providing required input info. The sbatch has preset information for running through SLURM scheduler, but should be able to run without SLURM.

Installation

Just place helper batch file and the script in working directory. Make the script executable.

chmod +x prok-snptree.sh

Dependencies

The following tools must be installed and their executables must be available in the ENV.

Function Tools/scripts
Parallelized sample processing GNU ParallelSource · Website
Quality check for raw reads FᴀsᴛQCSource · Website
MᴜʟᴛɪQCSource · Reference · Website
Adapter identification and trimming ᴄᴜᴛᴀᴅᴀᴘᴛSource · Reference
ᴛʀɪᴍ_ɢᴀʟᴏʀᴇSource · Reference · Website
Genome indexing and read alignment ʙᴡᴀSource · Reference · Reference · Website
Binary conversion and sorting SᴀᴍᴛᴏᴏʟsSource · Reference · Reference · Website
Variant calling and selection GATKSource · Reference · Website
Phylogenetic tree RAxMLSource · Reference · Website
Pairwise SNP count FᴀsᴛᴀTᴏSNPCᴏᴜɴᴛ.sʜSource

Input files

  • All paired gzipped fastq can be placed in working directory or in a subdirectory named fastq. The pipeline does not support singletons currently.
  • Reference genome should be in refs subdirectory, and named genome.fna. Symbolic links are accepted. Alternatively, the script can download it automatically (with wget) if a direct link is provided (see arguments below).
  • Reference annotation is not currently. For futureproofing, it may be provided inside refs subdirectory as genes.gtf. See alternative methods in arguments.

Arguments

The script accepts the following arguments, which are supplied from the helper sbatch file.

  1. refgenome (Reference genome)
    Direct link (url) to reference genome (gzipped). This for convenience, and the intended goal is to be able to download reference genome from NCBI etc. Keep its value empty to . if it will be supplied maunally (see input files above).

  2. refannotation (Annotation file for reference genome)
    Direct link (url) to reference genome (gzipped) as refgenome. This option is here for future function, and may be set to empty now.

  3. minDP, minQD, maxRDP, and minADP (VCF filtration parameters)

    • minDP: Minimum sequencing depth (Positions that fail are considered absent)
    • minQD: Minimum depth-normalized quality (SNPs that fail are ignored)
    • maxRDP: Maximum reference allele depth for SNPs to be accepted as real.
    • minADP: Minimum alternate allele depth for SNPs to be accepted as real.
  4. ncpu and nmem (Parallelization parameters)
    Number of CPUs and memory (overall) to use. If number of CPU is less than 8, only one sample is processed at a time with the number of CPU available. Otherwise, 8 CPUs are used per sample, and the samples are parallelized based on number of CPUs.

  5. nboot (Bootstrap)
    Number of bootstraps to be used while preparing the phylogenetic tree with RAxML.

Running the program

If SLURM is available, edit the resource parameters in sbatch file and run as sbatch slurm.batch.
If running without SLURM, run as bash slurm.batch. This has not been throughly tested, and is not officially supported in the current version.

Optimizations

The pipeline is written so as to enable resuming or rerunning. The main output files are name systematically and are used as checkpoints.
Some examples:

  • Trimming: non-empty \<sample>.og files in fastq subdirectory.
  • Alignment: non-empty .bam file in align/<sample> subdirectory.
  • Variant calling: non-empty .vcf file in variants/<sample> subdirectory.

If some samples fail, just rerun the pipeline after making changes with the inputs. Completed samples with valid outputs are not processed again.

If the pipeline exits in the middle of operation, just rerun, and the pipeline will pick up from last complete operation.

If you add new samples, just rerun the pipeline to process it.

Citation

A peer-reviewed paper is pending publication. Please cite the zenodo record at the moment as follows:

Sharma A. 2022. rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes (v0.1b). Zenodo. https://doi.org/10.5281/zenodo.7445133

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages