additions to the SWIF(r) framework using and AODE HMM
see yamls/ folder for two main envs:
- swifr_tools: contains plink, vcftools, selscan, scikit-allel, etc.
- swifr_example: help from CCV to build since this was tricky. used python version 3.7
for all of the following scripts, i used metadata files to indicate the unique values of each sim. the file had the following structure:
simulation number | population | generations | allele frequency | selection coefficient | AF min threshold |
---|---|---|---|---|---|
1 | pop1 | 200 | 0.2 | 0.0376 | 0.2 |
2 | pop1 | 200 | 0.2 | 0.0376 | 0.2 |
... | ... | ... | ... | ... | ... |
100 | pop1 | 200 | 0.2 | 0.0376 | 0.2 |
1 | pop1 | 200 | 0.4 | 0.0415 | 0.3 |
... | ... | ... | ... | ... | ... |
- SLiM simulations: scripts to generate corresponding sweep and neutral simulations - these are hard-coded for different sweep mutation introduction times.
- run_slim.sh: contains actual SLiM command with corresponding simulation script
- slim_array.sh: job array set up for sweep simulations based on metadata file of all sweep scenarios
- STS calculations: five scripts to generate input for SWIF(r) tool
For the sweep files, these scripts have been more efficiently set up:
- swifr_pipeline/sts/sts_sweep_array.sh: job array to generate statistics from sweep files
- swifr_pipeline/sts/sts_normalize_sweep_array.sh: job array to normalize statistics
- swifr_pipeline/sts/sts_train_test.sh: job array to create training and testing sets depending on sweep mutation location
For the neutral files, the original file structure is used:
- swifr_pipeline/sts/sts_prepare_neutral.sh: generate single-population VCF files and MAP files
- swifr_pipeline/sts/sts_execute_neutral.sh: calculate five point statistics for population 1
- swifr_pipeline/sts/sts_global_neutral.sh: use neutral files to create global statistics for normalization
- swifr_pipeline/sts/sts_normalize_neutral.sh: normalize all statistics against global files
- swifr_pipeline/sts/sts_concatenate_neutral.sh: concatenate training and testing statistic files for input to SWIF(r)
- SWIF(r): script to train and test with SWIF(r), Python scripts to generate ROC and precision recall curves to measure tool performance
- swifr_pipeline/run_swifr/run_swifr.sh: script to actually run swifr on training and testing data
- swifr_pipeline/run_swifr/swifr_alltests.sh: reference script for all swifr training and testing commands
- swifr_pipeline/run_swifr/ROC_curves_SWIFr.py: generate ROC curves from classified output
- swifr_pipeline/run_swifr/precision_recall_curves_SWIFr.py: generate precision-recall curves from classified output
- original SWIF(r) publication: https://www.nature.com/articles/s41467-018-03100-7#Sec11
- SWIF(r) repository: https://github.com/ramachandran-lab/SWIFr
- thank you to Dr. Carlos Sarabia for help in setting up much of this initial code structure!