GitHub - mattheatley/ngs_pipe: A pipeline to process NGS data on a SLURM cluster according to GATK best practice

mattheatley / ngs_pipe Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

A pipeline to process NGS data on a SLURM cluster according to GATK best practice

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README		README
core.py		core.py
pipe_ngs.py		pipe_ngs.py
sub_buddy.py		sub_buddy.py

Repository files navigation

READ ME

This software helps facilitate the processing of ngs data on a SLURM cluster according to parameters commonly used by Levi Yant's lab.



Requirements
  
  - python 3.7.3; os, sys, operator, time, argparse, subprocess, re, collections, statistics, itertools
  - trimmomatic 0.39
  - picard 2.21.1
  - bwa 0.7.17
  - samtools 1.9
  - gatk4 4.1.4.0
  - vcftools 0.1.16

N.B.
 If software is installed within a conda environment the environment name must be supplied at each stage (see below).



SETUP

Create the initial default/user-specified working & sub-directories.

  python pipe_ngs.py -setup [-p </path/to>] [-d <directory_name>]




PREPARE FILES

Move reference genome & single/paired-end fastqs to the relevant sub-directories.

i.e.

  /ref.gen
    genome.fasta
    
  /00.fastqs
    /Sample_1
      Reads_1.fastq.gz
      Reads_2.fastq.gz
    /Sample_2
      Reads.fastq.gz
      ...

N.B. 
 Read type is automatically assigned depending on whether or not ALL fastqs can be assigned a pair member.
 At present only novogene reads have been tested but similar naming conventions should also work.



INDEX

Index reference genome.

  python pipe_ngs.py -index



PROCESS DATA

Sequentially processes ngs data.

  python pipe_ngs.py -stage <number> [-x <ploidy>]


  Stage   Description

  1       Merge fastqs
  2       Trim reads
  3       Map reads
  4       Mark duplicates
  5       Call haplotypes
  6       Combine sample
  7       Genotype samples
  8       Re-combine parts
  9       Filter sites
  10      Filter depth

N.B. 
 Single/paired-end reads will be processed according to inferred read type between stages 1-3.
 Ploidy only needs to be specified at stage 5.



ADDITIONAL SETTINGS

  Flag                    Default             Description

  -pa <partition>         stage-specific      specify SLURM partition
  -no <nodes>             stage-specific      specify SLURM nodes
  -nt <ntasks>            stage-specific      specify SLURM ntasks
  -me <memory[units]>     stage-specific      specify SLURM memory
  -wt <HH:MM:SS>          stage-specific      specify SLURM walltime
 
  -d <diretory_name>      ngs_pipe            specify working directory name
  -p </path/to>           home path           specify working directory path
  -u <user_name>          current user        specify SLURM user name
  -m <email_address>      None                specify email address to receive SLURM notifications
  -e <environment_name>   ngs_env             specify conda environment with software installed 
  -l <limit>              100                 specify concurrent task submission limit
  
  -review                                     review stage SLURM accounting data i.e. -stage <number> -review
  -pipe                                       submit pipeline itself; submits new tasks up to the limit from the hpcc
  -sb                                         activate sub buddy; automatically re-submits any stalled tasks