Skip to content

Latest commit

 

History

History
52 lines (48 loc) · 5.55 KB

README.md

File metadata and controls

52 lines (48 loc) · 5.55 KB

archaic-chrX-sexbias

Code and data to reproduce main figures and simulations of Chevy, Huerta-Sanchez, and Ramachandran, 2023.

Plotting main figures plotting/

  • figX_plot.py contains all the code necessary to create Fig X.
  • figX_data/ contains all data necessary to plot Fig X. Datafiles from simulations are provided, and were generated using the scripts provided in generating_simulation_data/. Sources for previously published data are provided in README.md files.
  • extract_pI_data.py and extract_pI_data_sexbias.py contain helper functions to organize simulation output files into dataframes. They are imported as modules, so should be in the same directory or added to the path using sys.path.append('/path/to/files').

Calling 1kG introgression tracts hmmix/

prerequisites for running hmmix from this folder

running hmmix

  • ingroup_haps.txt lists the 1835 1kG sample IDs presented in the paper.
  • individuals.json defines the 1835 ingroup samples, and the 292 outgroup samples from the YRI, MSL, and ESN sample groups.
  • 1_submit_outgroup.sh is the first step, and creates the outgroup.txt file.
  • 2_submit_mutation_rate.sh is the second step, and creates the mutationrate.bed file.
  • 345_submit_ingroup_train_decode_decodeX.sh are the rest of the steps, which create and fill the obs_files/, trained_files/, decoded_chrXs/, and decoded_autosomes/ folders.

processing hmmix output

  • make_df_from_decodings.py filters coverage tract calls, calculates per-base-pair coverage rates, and saves output as a .csv.gz file that is loaded by plotting/fig2_plot.py.
    • N.B. The second section of fig2_plot.py, # %% create file for table 1, will overwrite plotting/fig1_data/table1_hmmix.txt with results from this output file.
  • decode_chrX.py is called by 345_submit_ingroup_train_decode_decodeX.sh, and implements a correction to the emissions parameters to decode chromsome X.
  • skov_selection_targets.csv contains locations of putative targets of selection on chromosome X. Tracts overlapping these were excluded from analysis.
  • all_hg19_GRCh37_chrX_gaps contains locations of all gaps (telomeres, centromeres, etc.) in chrX. These were not excluded from analysis, but can be using the remove_gaps variable in make_df_from_decodings.py (ln. 16).

Generating simulation data generating_simulation_data/

  • sim_seq_info.txt contains the exon coordinates and recombination rate map used in SLiM simulations.
  • simulation.slim is an example SLiM simulation script used to generate simulated archaic coverage data. In order to run a simulation, confirm that the:
    • paths to sim_seq_info.txt (ln. 17) and the output treesequence .trees.decap (ln. 106) are set correctly,
    • the inheritance method "A" for autosomal or "X" for chromosome X (ln. 4) is correct,
    • the male fraction of the archaic introgressors (ln. 5) is correct,
    • the dominance coefficient $h$ (second argument to initializeMutationType, ln. 12) is correct, and
    • the "timecourse" section (ln.s 96-99) is uncommented if you want to calculate archaic coverage in samples during the generations immediately following introgression.
  • tree_tools.py contains helper functions used in ibd_to_pI.py and ibd_to_spectra.py. It is imported as a module, so must be in the same directory or added to the path using sys.path.append('/path/to/tree_tools.py').
  • ibd_to_pI.py calculates the archaic coverage on haplotypes sampled at the end of a simulation.
    It takes a treesequence file (.trees.decap) generated by simulation.slim and generates two further files:
    • a tab-delimited text file (.pI) where the first column is the sample ID of the haplotype and the second column is the archaic coverage on that sample at the end of the simulation, and,
    • optionally, a pickled copy (.ibd) of a dictionary containing the start and end coordinates of each archaic coverage tract within each sample. This is only generated if the save_ibd_dicts variable in main() is set to True. Generating the .ibd file is necessary to calculate the coverage tract length spectra using ibd_to_spectra.py.
  • ibd_to_spectra.py calculates the lengths of archaic coverage tracts found on sampled haplotypes. It takes an .ibd file generated by ibd_to_pI.py and generates a compressed version (.tracts.npz) of the tract length spectra across all samples, specific to each human sample, and specific to each archaic introgressor.

Citation

Chevy, E. T., Huerta-Sánchez, E., & Ramachandran, S. (2023). Integrating sex-bias into studies of archaic introgression on chromosome X. PLOS Genetics, 19(8), e1010399. https://doi.org/10.1371/journal.pgen.1010399