Code and data to reproduce main figures and simulations of Chevy, Huerta-Sanchez, and Ramachandran, 2023.
figX_plot.py
contains all the code necessary to create Fig X.figX_data/
contains all data necessary to plot Fig X. Datafiles from simulations are provided, and were generated using the scripts provided ingenerating_simulation_data/
. Sources for previously published data are provided inREADME.md
files.extract_pI_data.py
andextract_pI_data_sexbias.py
contain helper functions to organize simulation output files into dataframes. They are imported as modules, so should be in the same directory or added to the path usingsys.path.append('/path/to/files')
.
- Install software as described here: https://github.com/LauritsSkov/Introgression-detection#installation
- Download all files as described in the original software here: https://github.com/LauritsSkov/Introgression-detection#getting-data
- TODO: diplofy males
- Place
.bcf
files intobcf_local/
, ancestral.fa
files intoanc_files/
, and archaic variant files into `archaicvar/. - Rename chromosome X
.bcf
to match format of others:mv /path/to/diplofied_chrX_file.bcf bcf_local/ALL.chrX.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf
ingroup_haps.txt
lists the 1835 1kG sample IDs presented in the paper.individuals.json
defines the 1835 ingroup samples, and the 292 outgroup samples from the YRI, MSL, and ESN sample groups.1_submit_outgroup.sh
is the first step, and creates theoutgroup.txt
file.2_submit_mutation_rate.sh
is the second step, and creates themutationrate.bed
file.345_submit_ingroup_train_decode_decodeX.sh
are the rest of the steps, which create and fill theobs_files/
,trained_files/
,decoded_chrXs/
, anddecoded_autosomes/
folders.
make_df_from_decodings.py
filters coverage tract calls, calculates per-base-pair coverage rates, and saves output as a.csv.gz
file that is loaded byplotting/fig2_plot.py
.- N.B. The second section of
fig2_plot.py
,# %% create file for table 1
, will overwriteplotting/fig1_data/table1_hmmix.txt
with results from this output file.
- N.B. The second section of
decode_chrX.py
is called by345_submit_ingroup_train_decode_decodeX.sh
, and implements a correction to the emissions parameters to decode chromsome X.skov_selection_targets.csv
contains locations of putative targets of selection on chromosome X. Tracts overlapping these were excluded from analysis.- File is renamed from
media-3.csv
, downloaded from the preprint. The results are now published as Skov et al., 2023.
- File is renamed from
all_hg19_GRCh37_chrX_gaps
contains locations of all gaps (telomeres, centromeres, etc.) in chrX. These were not excluded from analysis, but can be using theremove_gaps
variable inmake_df_from_decodings.py
(ln. 16).
-
sim_seq_info.txt
contains the exon coordinates and recombination rate map used in SLiM simulations. -
simulation.slim
is an example SLiM simulation script used to generate simulated archaic coverage data. In order to run a simulation, confirm that the:- paths to
sim_seq_info.txt
(ln. 17) and the output treesequence.trees.decap
(ln. 106) are set correctly, - the inheritance method
"A"
for autosomal or"X"
for chromosome X (ln. 4) is correct, - the male fraction of the archaic introgressors (ln. 5) is correct,
- the dominance coefficient
$h$ (second argument toinitializeMutationType
, ln. 12) is correct, and - the "timecourse" section (ln.s 96-99) is uncommented if you want to calculate archaic coverage in samples during the generations immediately following introgression.
- paths to
-
tree_tools.py
contains helper functions used inibd_to_pI.py
andibd_to_spectra.py
. It is imported as a module, so must be in the same directory or added to the path usingsys.path.append('/path/to/tree_tools.py')
. -
ibd_to_pI.py
calculates the archaic coverage on haplotypes sampled at the end of a simulation.
It takes a treesequence file (.trees.decap
) generated bysimulation.slim
and generates two further files:- a tab-delimited text file (
.pI
) where the first column is the sample ID of the haplotype and the second column is the archaic coverage on that sample at the end of the simulation, and, - optionally, a pickled copy (
.ibd
) of a dictionary containing the start and end coordinates of each archaic coverage tract within each sample. This is only generated if thesave_ibd_dicts
variable inmain()
is set toTrue
. Generating the.ibd
file is necessary to calculate the coverage tract length spectra usingibd_to_spectra.py
.
- a tab-delimited text file (
-
ibd_to_spectra.py
calculates the lengths of archaic coverage tracts found on sampled haplotypes. It takes an.ibd
file generated byibd_to_pI.py
and generates a compressed version (.tracts.npz
) of the tract length spectra across all samples, specific to each human sample, and specific to each archaic introgressor.
Chevy, E. T., Huerta-Sánchez, E., & Ramachandran, S. (2023). Integrating sex-bias into studies of archaic introgression on chromosome X. PLOS Genetics, 19(8), e1010399. https://doi.org/10.1371/journal.pgen.1010399