This repo contains code for determining how well wild genetic diversity is captured ex situ for both Quercus acerifolia (abbreviated to QUAC in this repo), a rare and threatened North American oak species native to Arkansas and Zamia integrifolia (abbreviated to ZAIN in this repo), the only native cycad to the continental US, is also rare and threatened species. These species are similarly threatened because they have limited population numbers, a history of fragmented habitat due to intensive human land use, and are both very well-preserved in botanic gardens. QUAC and ZAIN both have >300 individuals ex situ, while only ~150 individuals ex situ are predicted to capture large portions of diversity in these species (Rosenberger et al., 2022), so we hypothesize:
- That both species' wild genetic diversity should be well-represented in botanic gardens. We can test this hypothesis by determining if:
- Genetic diversity in wild and garden populations does not differ significantly.
- At least 95% of all wild alleles, of all frequencies are representedex situ.
- Duplicates of wild alleles are represented ex situ.
- Geographic diversity of both QUAC and ZAIN is well-represented in botanic gardens, which we will determine by:
- Examining if there are samples sourced from all wild populations in ex situ collections.
- Running STRUCTURE and STRUCTURE harvester on wild and garden populations to determine if all wild genetic structure is represented in botanic garden collections.
- Performing PCA on both species to determine if gardens encompass all wild genetic structure.
- Assigning botanic garden individuals to wild source populations using Geneclass 2.
- We also attempt to provide recommendations for improving ex situ collections by assessing if:
- Relatedness is higher in ex situ populations than in wild populations.
- We also performed resampling analyses to determine how efficient our collections were at representing wild diversity (see Hoban et al., 2020 and Griffith et al., 2020).
The code in this repo details the analyses performed on nuclear microsatellite data for QUAC and ZAIN to test our above hypotheses and determine how we can improve guidelines for creating ex situ collections.
Quercus acerifolia is a rare oak native to Arkansas with only four (or maybe five) wild populations and around 600-1000 wild individuals. Wild leaf tissue for this project was collected from 174 wild individuals during sampling trips by colleagues in 2019. Samples of the botanic garden individuals were collected in collaboration with 15 different botanic gardens in 2019 and resulted in 316 tissue samples for genetic data collection. Genotyping was performed with 15 nuclear microsatellite loci, with 10 being expressed sequence tag associated microsatellites and 5 being genomic microsatellites.
Zamia integrifolia is the only cycad species native to the continental US and has been extirpated from parts of its range in the last century due to habitat fragmentation and overharvesting by humans. For this project, 382 individuals from 10 different botanic gardens and 751 ndividuals were collected from 25 different wild populations. Genotyping was performed with 11 microsatellite loci for both wild and garden individuals.
This repo is divided into 2 main files - Analyses and Data_Files. There is also an Archive folder but this is temporary file to store code that is being refined.
Data_Files: This file contains the input data files for each analysis performed in the repo. There are 2 main types of data files - data frames and adegenet files.
- Adegenet_Files are either genepop (.gen) or arlequin files (.arp). The arlequin files are only created to be converted to the genepop files, which is done using the "arp2gen" function from diveRsity. The genepop files are imported in the R package 'adegenet' to do the majority of the genetic analyses performed in this study. The genepop files are similar to the format of the CSV files, but use commas after the individual name to denote belonging to the same population as other individuals, and then a separate line with "POP" to indicate a population break. All individuals are named with their population marker. At the top of the document all the loci used to genotype the individuals are listed- each one on a line. There is also a comment line with what populations and individuals each adegenet file.
- CSV_Files contain the genotypes of each individual for each species in a format where individual IDs are in the first column, population names are in the second column (either wild population name or botanic garden name) and an ID for what population type the individual is (either a wild or botanic garden individual) is the third column. The fourth column onward is where the allele data are stored, with one column per allele and each consecutive column pair being the 2 alleles for one locus. Alleles are indicated with their microsatellite length.
- Geneclass_Files are the files used to run the Geneclass 2 software (Piry et al. 2004) for assignment into source populations. The input files for Geneclass are defined using "input" and are in the Genepop format as described above, but with numeric coding for alleles in a two digit format. Additionally, in this folder, the Genalex files to generate the each input Genepop file is included. Genalex (Peakall & Smouse, 2006) was used to convert these files into the input Genepop files for the Geneclass software. The "wild" genepop files are used as the reference populations in the Geneclass software and the "garden" genepop files are used as the samples to be assigned in the software.
- Structure_Filesare the files used to run the program STRUCTURE (Pritchard et al. 2000). All structure files are text files, but the files deemed "_str" are the files that are initially generated from the conversion to structure function in Genalex (Peakall and Smouse, 2006). The files titled "_str_READY.txt" are the files STRUCTURE was actually run on. The "_str_READY.txt" file is a text file with a similar layout to the CSV files described above, just with individual names in each row and a population name assigned as numerals. There are no column headers in this text file.
- Spagedi_Files are the files used to the SPAGeDi program (Hardy & Vekemans, 2002). We ran relatedness analysis in SPAGeDi using the Loiselle et al., (1995) statistic for both QUAC (with and without Kessler populations) and ZAIN (with all populations, rebinned). These data files are stored in text files and are similar to genepop file format except the top two lines indicate the ploidy, individual number, structure of populations, and what spatial distances need to be analyzed.
Naming conventions: There are several file naming conventions within this folder that refer to different scenarios for data files.
- There are two types of QUAC data files: "QUAC_wK" (with Kessler mountain individuals) and "QUAC_woK" (without Kessler mountain individuals). Population genetic analyses can be biased by small population sizes, and the Kessler mountain individuals are only represented by 8 individuals, so the inclusion of this population could bias population genetic analyses. We therefore performed all population genetic analyses with and without these individuals to determine if the analyses were influenced by the inclusion of these individuals. Similar to this we performed most analyses on ZAIN without small wild populations, and these results files are indicated with "wo_smallpops" in the name. These populations are removed in the analysis R Scripts.
- There are also two types of ZAIN data files - "ZAIN_og" (original scores) and "ZAIN_rebinned" (rebinned scores). Microsatellite scoring analysis was performed at different times for garden and wild individuals, and seeing as there is some degree of subjectivity in creating bins for microsatellite scoring depending on the person who performs the analysis, we performed a rebinning analysis to make microsatellite scores consistent between person-scoring. However, we also performed all analyses on the original scores and rebinned scores to determine how this analysis affected resulting conclusions of genetic diversity within this species; however, it was determined that the results generated from the ZAIN_rebinned data files are likely more consistent with true genetic diversity of the species rather than relicts of human introduced differentiation.
- A file marked with "df" in the title is in the format as described above, whereas "genalex" data files are in the Genalex file format (Peakall and Smouse, 2006).".gen" files are genepop files used in the adegenet package in R.
Organization: The overall file structure of the "Data_Files" folder
- Adegenet_Files
- CSV_Files
- Geneclass_Files
- Spagedi_Files
- Structure_Files
- QUAC
- QUAC_wK_nopopinfo_garden_wild
- QUAC_wK_nopopinfo_wild
- QUAC_woK_nopopinfo_garden_wild
- QUAC_woK_nopopinfo_wild
- ZAIN
- ZAIN_rebinned_nopopinfo_garden_wild
- ZAIN_rebinned_nopopinfo_wild
Analyses: The three main folders in this folder are Analysis_RScripts, Functions, and Results. The RScripts are used to run analyses on the files in the Data_Files folder, the Functions folder contains functions created to run certain analyses, and the Results folder contains the results of those analyses.
Analysis steps
For each genepop file we performed several analysis steps for both QUAC and ZAIN; below we list all analysis RScripts and what results they are used to generate.
- 01_ZAIN_Scoring_Comparison_Barplot.R: Both QUAC and ZAIN were genotyped using microsatellite markers, which have some degree of subjectivity in calling an allele with a decimal number into an integer-based bin, and in determining the location of peaks. When microsatellites are called by different people, sometimes scoring differences result. Usually, this is not a problem as all individuals for a project are scored at the same time by the same researcher, but for ZAIN, microsatellite data were scored by different researchers over the course of ~10 years. Therefore, we analyzed all of the microsatellite by year scored (garden vs. wild) and determined there were four loci that differed based on year scored. We then "rebinned" scores to be consistent by year (described in more detail in the supplement of the manuscript) and performed all genetic analyses on original scores - called ZAIN_og data files - and data files with rebinned scores - called ZAIN_rebinned data files - to determine the effect rebinning analysis had on results.
- 02_clonecheck_md.R: We cleaned each data file for clones and missing data. Any individual with greater than 25% missing data was removed from the analysis. Clones (individuals with identical genotypes) were reduced to one individual from the pair. This script is used to generate the "clean" genepop and data frame files, which are used in all another genetic analyses.
- 03_gendiv_summary_stats.R: For all loci for each species (and scenario - with and without the Kessler population in QUAC, and with og and rebinned scores for ZAIN) we assessed for devitations from Hardy-Weinberg Equilibrium expectations, predicted null allele frequency, and linkage disequilibrium. We also assessed each wild population and botanic garden collection for its sample number, MLG (multi-locus genotype numbers), average number of alleles, allelic richness, and expected heterozygosity. For ZAIN individuals, we also ran the analyses with and without small populations.
- 04_garden_wild_comparison.R: This script examines the genetic diversity levels - in the form of allelic richness and expected heterozygosity - between population types, either wild or botanic gardens. We then also examined the number of alleles represented in botanic gardens from wild populations in different frequency categories - rare alleles, common alleles, etc. - as well as in multiple copies.
- 05_allelic_resampling.R: We also resampled wild individuals to determine at what individual number allelic representation ex situ reaches 95%.
- 07_PCA: R Script for generating PCA for all data files - comparing wild and botanic garden genetic structure.
- 08_Structure: R Script for generating STRUCTURE Q matrix data files.
- 09_maternal_accessions: This script cleans all of the botanic garden names of either species and identifies how many maternal lineages there are within each botanic garden (i.e., how many half-siblings or individuals with the same mother).
- 10_assignment_test_performance: We compared the assignment test results generated by Geneclass 2 and compared them with the source information for botanic garden individuals of QUAC and ZAIN.
Organization of Analyses folder:
- Analysis_RScripts
- 1_ZAIN_Scoring_Comparison_Barplot.R
- 2_clonecheck_md.R
- 3_gendiv_summary_stats.R
- 4_garden_wild_comparison.R
- 5_allelic_resampling.R
- 7_PCA.R
- 8_Structure.R
- 9_maternal_accessions.R
- 10_assignment_test_performance.R
- 11_relatedness.R
- final_manuscript_figures.R
- final_map_figs.R
- QUAC_clonal_propagation.R
- Functions
- accession_count.R
- dms_degree_conversion.R
- Fa_sample_funcs.R
- maternal_accession.R
- relatedness_analyses.R
- resampling.R
- structure_cluster_match.R
- Results
- Clustering
- Geneclass
- PCA
- Structure
- QUAC
- ZAIN
- Garden_Wild_Comparisons
- Relatedness
- Scoring_Comparison
- Sum_Stats
Earl, D. A., & VonHoldt, B. M. (2012). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation genetics resources, 4(2), 359-361.
Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular ecology, 14(8), 2611-2620.
Griffith, M. P., Clase, T., Toribio, P., Piñeyro, Y. E., Jimenez, F., Gratacos, X., ... & Hoban, S. (2020). Can a botanic garden metacollection better conserve wild plant diversity? A case study comparing pooled collections with an ideal sampling model. International Journal of Plant Sciences, 181(5), 485-496.
Hoban, S., Callicrate, T., Clark, J., Deans, S., Dosmann, M., Fant, J., ... & Griffith, M. P. (2020). Taxonomic similarity does not predict necessary sample size for ex situ conservation: a comparison among five genera. Proceedings of the Royal Society B, 287(1926), 20200102.
Peakall, R. O. D., & Smouse, P. E. (2006). GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular ecology notes, 6(1), 288-295.
Piry, S., Alapetite, A., Cornuet, J. M., Paetkau, D., Baudouin, L., & Estoup, A. (2004). GENECLASS2: a software for genetic assignment and first-generation migrant detection. Journal of heredity, 95(6), 536-539.
Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945-959.
Rosenberger, K., Schumacher, E., Brown, A., & Hoban, S. (2021). Proportional sampling strategy often captures more genetic diversity when population sizes vary. Biological Conservation, 261, 109261.