Skip to content

An automated workflow for processing ddRADseq data using Stacks v2.4. Starts with sequencer files, and ends with various input files for phylogenetic/phylogeography programs.

License

Notifications You must be signed in to change notification settings

vbutardo/Stacks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stacks Pipeline

Overview

The goal of this workflow was to automate all major steps involved with processing common ddRADseq datasets, using the newest version of Stacks (v2.4). In particular, this workflow is designed to process single end (SE) read data generated from ddRADseq libraries prepared with the SbfI and MspI restriction enzymes.

Full pipeline: The gzipped fastq files from the sequencer are optionally trimmed for UMI sites, demultiplexed using process_radtags, and RAD cutsites are trimmed from all resulting fastq files. The partial or full Stacks pipeline can then be run (involving the ustacks, cstacks, sstacks, tsv2bam, gstacks, and populations modules), and the user can specify several key parameters for these steps. A range of missing data values is automatically used for the populations module, resulting in several populations.haplotypes.tsv files. These haplotypes.tsv files resulting from the populations module are then subjected to additional filtering. This includes removing samples exceeding a user-specified threshold of missing data, optionally removing singletons, and selecting one SNP site per locus (either the first site or a random site). This step can be run multiple times using different per-sample missing data thresholds, and distinct filtered tsv files are created for each run. Finally, all the filtered tsv files are converted into corresponding phylip, fasta, nexus, structure, ped, and map files. Summaries of all datasets created from the filtered tsv files are provided, allowing the user to choose which settings resulted in the highest quality dataset (in terms of number of samples, loci, and missing data).

Partial Pipeline: After processing your data and running Stacks independently, you can filter a resulting populations.haplotypes.tsv file and convert it into a variety of useful output formats. This allows you to process your data independent of the pipeline, but still take advantage of filtering steps and format conversions.

Dependencies

The Stacks Pipeline relies on seqtk (NEW as of v2.1) and Stacks v2.4. These programs must be installed in path. They can be downloaded from the following sources:

The Stacks Pipeline scripts can be run using Mac OSX (10.10+) and Linux, and can also work with Windows using a program like Cygwin.

Instructions

Documentation and usage instructions are available on the wiki page here.

If you have already run Stacks yourself and have produced a populations.haplotypes.tsv file, you can filter and convert that file using the following scripts:

  1. Filter_Single_tsv.py: Applies filtering to loci contained in the specified populations.haplotypes.tsv file resulting from the populations module. Selects one SNP per locus (first or random site) and optionally removes singletons. Calculates per-sample missing data and removes samples above user-selected thresholds. Writes filtered tsv file.

  2. Convert_All_tsv.py: Convert the filtered tsv file to phylip, fasta, nexus, structure, ped, and map formats. Summarizes dataset metrics for convenient comparisons.

  3. Convert_tsv_to_dadi.py: Optional. Create a SNPs input file for use with the demographic modeling program dadi.

The full usage of the pipeline starts with gzipped fastq sequencer files and ends with output files. The general order of the full workflow is as follows:

  1. Demultiplex_Trim.py: Demultiplexes fastq.gz files using process_radtags and trims RAD cutsites with seqtk. Offers an option to remove UMI sites of any length prior to demultiplexing.

  2. Run_Stacks.py: Automates the full Stacks pipeline (ustacks, cstacks, sstacks, tsv2bam, gstacks, populations) or a partial Stacks run (post-ustacks or individual modules), based on a variety of user-selected options and parameter settings.

  3. Filter_All_tsv.py: Applies filtering to loci contained in populations.haplotypes.tsv files resulting from independent runs of the populations module. Selects one SNP per locus (first or random site) and optionally removes singletons. Calculates per-sample missing data and removes samples above user-selected thresholds. Writes filtered tsv files.

  4. Convert_All_tsv.py: Converts filtered tsv files to phylip, fasta, nexus, structure, ped, and map formats. Summarizes dataset metrics for convenient comparisons.

  5. Convert_Stacks_Fasta_to_Loci: Optional. Parses the phased sequences of all samples and loci in a given populations.samples.fa fasta file and writes to locus-specific fasta files. Offers option to write both alleles per sample, the first allele, a random allele, or a consensus sequence. An optional filter option is included that only writes loci with at least one variable site.

  6. Convert_tsv_to_dadi.py: Optional. Create a SNPs input file for use with the demographic modeling program dadi.

Version

The current release of the Stacks Pipeline is v2.1.

Major changes in v2.1:

  • seqtk is now used in place of fastx_trimmer. It is much faster and easier to install.

Changes in v2.0:

  • Now uses Stacks v2.41 (vs. 1.35).
  • All modules are now compatible with Python 2.7 and Python 3.7.
  • Offers new custom filtering and output file options.
  • Allows specification of key parameters for individual Stacks modules (including -M, -m, and -n).

License

GNU Lesser General Public License v3.0

About

An automated workflow for processing ddRADseq data using Stacks v2.4. Starts with sequencer files, and ends with various input files for phylogenetic/phylogeography programs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%