The posseleff_simulations repository is designed to assess the impact of
positive selection on Identity by Descent (IBD)-based inferences, leveraging
population genetic simulation and true IBD methodologies. The pipeline begins by running
simulations to generate tree sequences, allowing for both single and multiple
population models for various analyses. The selection simulation can be tailored
with parameters such as the selection coefficient, number of origins of the
favored mutation, selection starting time, and high-relatedness simulation
options. From there, the pipeline involves calling true IBD segments from the
tree sequence, followed by IBD processing for generating input files, selection
correction, and calling IBDNe and Infomap for Ne and population structure
inference. The pipeline is highly configurable, accommodating different
scenarios and requirements, and produces detailed output for further analysis.
Installation and execution instructions are provided, as well as options for
customization, making it a versatile tool for many IBD-related analyses.
- Run simulation and generate tree sequence
- Two demographic models
- Single population model for IBD distribution and Ne analysis, and
- Multiple population model for population structure analysis
- Selection simulation:
- Each of the above models allows for positive selection simulation
- Tunable Simulation parameters include
- selection coefficient
- number of origins of the favored mutation
- selection starting time (generations)
- High-relatedness/inbreeding simulations
- support inbreeding modeling via three strategies:
- shrinking population size
- positive assortative mating
- selfing
- see
simulations/Readme.mdfor examples.
- support inbreeding modeling via three strategies:
- Two demographic models
- Call true IBD segment from tree sequence
- IBD processing and selection correction
- IBD processing for generating input files for IBDNe (Ne estimation)
- IBD processing for generating IBD for calling Infomap (population structure)
- Identify and validate IBD peaks (due to selection)
- Remove IBD within validated IBD peak region and generate a selection-corrected version of IBD for calling IBDNe and Infomap
- Call
IBDNeandInfomapfor Ne and population structure inference
The pipeline has been tested on Linux Operation system and can be easily adapted to MacOS with simple changes. Software dependencies and the version numbers are specified in the './env.yaml' Conda recipe. Additional depencies that are not available from Conda are specified in the installation instruction below. The overall installation time is about 5-15 minutes.
To create the software environment:
- Install nextflow. See nextflow documentation
- Install conda from here if you have not
- Install software:
python3 ./init.py, this will
- Activate the
simulationenvironment:conda activate simulation - Run the pipeline:
nextflow ./main.nf -profile sge --num_reps 30 -resume. - For large datasets, using a cluster such as SGE is recommended. An example
sgeprofile is provided in thenextflow.configfile and should be adjusted to fit your cluster system. If run on a local computer, please remove the-profile sgeoption from the above command. - The pipeline can be reconfigured in the following files
- Pipeline parameters can be found top lines in
main.nf - More simulation parameters can be found in the definition of
sp_defaults,mp_defaults,sp_setsandmp_setsdictionaries within themain.nffile - For more complicated or large scale simulations, the
--sp_sets_json,--mp_sets_jsonarguments and-entryoption are recommended. seesimulations/Readme.mdfor examples.
- Pipeline parameters can be found top lines in
- No input or test data is needed. If desired, pipeline can be reconfigured as
mentioned above. The pipeline can be tested by commenting all but one entry in
the
sp_setsormp_sets. - Output files/folders:
- each subfolder of
resdirrepresents simulations for a set of chromosomes - within each subfolder:
ifm_outputcontains infomap resultsne_outputcontains IBDNe estimates
- each subfolder of
If you find this repository useful, please cite our preprint:
Guo, B., Borda, V., Laboulaye, R., Spring, M. D., Wojnarski, M., Vesely, B. A., Silva, J. C., Waters, N. C., O'Connor, T. D., & Takala-Harrison, S. (2023). Strong Positive Selection Biases Identity-By-Descent-Based Inferences of Recent Demography and Population Structure in Plasmodium falciparum. bioRxiv : the preprint server for biology, 2023.07.14.549114. https://doi.org/10.1101/2023.07.14.549114
Other citations:
iHSstatistics and calculation:
iHS calculation via scikit-allel:
Miles, A. et al. cggh/scikit-allel: v1.3.7. (2023) doi:10.5281/ZENODO.8326460.
iHS statistics:
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
IBDNe
Browning, S. R., & Browning, B. L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. American journal of human genetics, 97(3), 404–418. https://doi.org/10.1016/j.ajhg.2015.07.012
Infomapalgorithm
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123. https://doi.org/10.1073/pnas.0706851105