Skip to content

A pipeline to process NGS data on a SLURM cluster according to GATK best practice

Notifications You must be signed in to change notification settings

mattheatley/ngs_pipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

READ ME

This software helps facilitate the processing of ngs data on a SLURM cluster according to parameters commonly used by Levi Yant's lab.



Requirements
  
  - python 3.7.3; os, sys, operator, time, argparse, subprocess, re, collections, statistics, itertools
  - trimmomatic 0.39
  - picard 2.21.1
  - bwa 0.7.17
  - samtools 1.9
  - gatk4 4.1.4.0
  - vcftools 0.1.16

N.B.
 If software is installed within a conda environment the environment name must be supplied at each stage (see below).



SETUP

Create the initial default/user-specified working & sub-directories.

  python pipe_ngs.py -setup [-p </path/to>] [-d <directory_name>]




PREPARE FILES

Move reference genome & single/paired-end fastqs to the relevant sub-directories.

i.e.

  /ref.gen
    genome.fasta
    
  /00.fastqs
    /Sample_1
      Reads_1.fastq.gz
      Reads_2.fastq.gz
    /Sample_2
      Reads.fastq.gz
      ...

N.B. 
 Read type is automatically assigned depending on whether or not ALL fastqs can be assigned a pair member.
 At present only novogene reads have been tested but similar naming conventions should also work.



INDEX

Index reference genome.

  python pipe_ngs.py -index



PROCESS DATA

Sequentially processes ngs data.

  python pipe_ngs.py -stage <number> [-x <ploidy>]


  Stage   Description

  1       Merge fastqs
  2       Trim reads
  3       Map reads
  4       Mark duplicates
  5       Call haplotypes
  6       Combine sample
  7       Genotype samples
  8       Re-combine parts
  9       Filter sites
  10      Filter depth

N.B. 
 Single/paired-end reads will be processed according to inferred read type between stages 1-3.
 Ploidy only needs to be specified at stage 5.



ADDITIONAL SETTINGS

  Flag                    Default             Description

  -pa <partition>         stage-specific      specify SLURM partition
  -no <nodes>             stage-specific      specify SLURM nodes
  -nt <ntasks>            stage-specific      specify SLURM ntasks
  -me <memory[units]>     stage-specific      specify SLURM memory
  -wt <HH:MM:SS>          stage-specific      specify SLURM walltime
 
  -d <diretory_name>      ngs_pipe            specify working directory name
  -p </path/to>           home path           specify working directory path
  -u <user_name>          current user        specify SLURM user name
  -m <email_address>      None                specify email address to receive SLURM notifications
  -e <environment_name>   ngs_env             specify conda environment with software installed 
  -l <limit>              100                 specify concurrent task submission limit
  
  -review                                     review stage SLURM accounting data i.e. -stage <number> -review
  -pipe                                       submit pipeline itself; submits new tasks up to the limit from the hpcc
  -sb                                         activate sub buddy; automatically re-submits any stalled tasks
  

About

A pipeline to process NGS data on a SLURM cluster according to GATK best practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages