diff --git a/paper/build.sh b/paper/typst/build.sh similarity index 100% rename from paper/build.sh rename to paper/typst/build.sh diff --git a/paper/diagram.d2 b/paper/typst/diagram.d2 similarity index 100% rename from paper/diagram.d2 rename to paper/typst/diagram.d2 diff --git a/paper/diagram.png b/paper/typst/diagram.png similarity index 100% rename from paper/diagram.png rename to paper/typst/diagram.png diff --git a/paper/diagram.svg b/paper/typst/diagram.svg similarity index 100% rename from paper/diagram.svg rename to paper/typst/diagram.svg diff --git a/paper/figure-examples.png b/paper/typst/figure-examples.png similarity index 100% rename from paper/figure-examples.png rename to paper/typst/figure-examples.png diff --git a/paper/imgt.png b/paper/typst/imgt.png similarity index 100% rename from paper/imgt.png rename to paper/typst/imgt.png diff --git a/paper/imgt2.png b/paper/typst/imgt2.png similarity index 100% rename from paper/imgt2.png rename to paper/typst/imgt2.png diff --git a/paper/imgt3.png b/paper/typst/imgt3.png similarity index 100% rename from paper/imgt3.png rename to paper/typst/imgt3.png diff --git a/paper/imgt4.png b/paper/typst/imgt4.png similarity index 100% rename from paper/imgt4.png rename to paper/typst/imgt4.png diff --git a/paper/justfile b/paper/typst/justfile similarity index 100% rename from paper/justfile rename to paper/typst/justfile diff --git a/paper/lapreprint.typ b/paper/typst/lapreprint.typ similarity index 100% rename from paper/lapreprint.typ rename to paper/typst/lapreprint.typ diff --git a/paper/logo.png b/paper/typst/logo.png similarity index 100% rename from paper/logo.png rename to paper/typst/logo.png diff --git a/paper/main.pdf b/paper/typst/main.pdf similarity index 95% rename from paper/main.pdf rename to paper/typst/main.pdf index c84a07f..95303d2 100644 Binary files a/paper/main.pdf and b/paper/typst/main.pdf differ diff --git a/paper/main.typ b/paper/typst/main.typ similarity index 75% rename from paper/main.typ rename to paper/typst/main.typ index 2a5b39b..9fc2e93 100644 --- a/paper/main.typ +++ b/paper/typst/main.typ @@ -44,7 +44,11 @@ abstract: ( ( title: "Summary", - content: [The human leukocyte antigen (HLA) genes have thousands of different alleles in the human population, and have more associations with human diseases than any other genes. Data for all known HLA genotypes are curated in the international ImMunoGeneTics (IMGT) database, and the Allele Frequency Net Database (AFND) provides allele frequencies for each HLA allele across human populations. Our open-source R package _hlabud_ facilitates access to HLA data from IMGT/HLA and AFND, and provides functions for HLA divergence calculations, fine-mapping analysis of amino acid (or nucleotide) positions, and low-dimensional embedding.] + content: [ +The human leukocyte antigen (HLA) genes have more associations with human diseases than any other genes, and there are thousands of different HLA alleles in the human population. +Data for all known HLA genotypes are curated in the international ImMunoGeneTics (IMGT) database, and allele frequencies for each HLA allele across human populations are available in the Allele Frequency Net Database (AFND). +Our open-source R package _hlabud_ accesses HLA data from IMGT and AFND, and supports further analysis such as HLA divergence calculation, fine-mapping analysis of amino acid (or nucleotide) positions, and low-dimensional embedding. + ] ), (title: "Availability", content: [Source code and documentation are available at *#link("https://github.com/slowkow/hlabud")[github.com/slowkow/hlabud]*]), (title: "Contact", content: [#link("mailto:kslowikowski@mgh.harvard.edu")[kslowikowski\@mgh.harvard.edu]]) @@ -55,27 +59,27 @@ = Introduction -Human leukocyte antigen (HLA) genes encode the proteins that enable cells to display antigens to other cells, so the immune system can recognize pathogens such as bacteria and viruses. -Geneticists have identified thousands of variants (e.g. single nucleotide polymorphisms) in the human genome that are associated with hundreds of different diseases and phenotypes @Kennedy2017. +Human leukocyte antigen (HLA) genes encode the proteins that enable cells to display antigens to other cells, which is one mechanism for immune recognition of pathogens such as bacteria and viruses. +Geneticists have identified thousands of variants (e.g. single nucleotide polymorphisms) in the human genome that are associated with hundreds of different diseases and phenotypes @Kennedy2017. HLA genes have a greater number of disease associations than any other genes. -HLA nomenclature consists of allele names like _HLA*01:01_ to indicate the genotype of each individual in a study. -Each allele name corresponds to multiple mutations at different positions throughout the gene's sequence, so it is difficult to estimate the similarity of two alleles solely from the allele names. -This ambiguity about specific amino acid positions means that allele names are not ideal for statistical analysis. +HLA nomenclature consists of allele names like _HLA*01:01_ and _HLA*02:01_ to indicate the genotype of an individual in a study @Marsh2010. +Each allele name corresponds to a haplotype that contains multiple mutations at different positions throughout the entire length of the gene sequence. +It is difficult to estimate the similarity of two alleles solely from the allele names: any two alleles might differ by one or more nucleotide or amino acid residues. +Any encoding of genotype data that is ambiguous regarding nucleotide or amino acid positions is not ideal for statistical analysis, because some positions might contain more information than others. -Researchers have developed software tools for calling HLA genotypes (@diagram) with high accuracy from DNA-seq or RNA-seq next-generation sequencing reads @Claeys2023, so there may be opportunities to use this type of data for HLA association studies. -Most software tools report allele names, not genotypes at specific nucleotide positions. -Providers of HLA typing services often report genotypes with the traditional HLA allele names (i.e. _HLA*01:01_) instead of reporting alleles at specific nucleotide positions (@diagram). +Researchers have developed many software tools for calling HLA genotypes (@diagram) with high accuracy from DNA-seq or RNA-seq next-generation sequencing reads @Claeys2023, so there are opportunities to use this type of data for HLA association studies. +Providers of HLA typing services often report genotypes with the traditional HLA allele names (i.e. _HLA*01:01_) instead of reporting alleles at specific nucleotide positions (@diagram), and most software tools produce outputs that follow this convention of reporting allele names. #figure( move(dx:-5%, dy:0pt, image("diagram.png", width: 130%)), caption: [_hlabud_ converts HLA genotypes to amino acid position matrices.] ) -In contrast to allele-level analysis, fine-mapping analysis associates a phenotype with each amino acid at each position. +In contrast to allele-level analysis, fine-mapping analysis associates a phenotype with each amino acid (or nucleotide) at each position. Many amino acid residues at specific loci have been associated with human diseases and blood protein levels @Krishna2023. Published amino acid associations represent opportunities for experimental validation that could advance understanding of the disease-associated mechanisms related to HLA proteins. -Fine-mapping results can be interpreted in the context of the protein structures that are affected by the associated amino acid positions. +Results from fine-mapping analysis can be interpreted in the context of the protein structures that are affected by the associated amino acid positions. We might have different hypotheses about the function of a mutation in the peptide binding groove than a mutation in the interior region of the protein. To facilitate HLA fine-mapping, we developed _hlabud_, a free and open-source R package that downloads data from the IMGT/HLA database @Robinson2020 and automatically creates amino acid (or nucleotide) position matrices that are ready for analysis (@diagram). @@ -159,14 +163,17 @@ The complete manual is available at #link("https://slowkow.github.io/hlabud"). _ = Discussion Our open-source R package _hlabud_ gives users access to HLA data from two public databases, and implements HLA divergence calculation @Pierini2018. -_hlabud_ downloads HLA genotype data from the IMGT-HLA GitHub repository @imgthla, caches it in a user-configurable folder, and prepares the data for downstream analysis in R. +_hlabud_ downloads and caches HLA genotype data from the IMGT-HLA GitHub repository @imgthla and prepares the data for downstream analysis in R. -We provide #link("https://slowkow.github.io/hlabud", "tutorials") for HLA divergence, fine-mapping association analysis with logistic regression, and embedding with UMAP. -_hlabud_ provides allele frequencies for all HLA genes, obtained from the Allele Frequency Net Database (AFND) @Gonzalez-Galarza2020. +We provide #link("https://slowkow.github.io/hlabud", "tutorials") for HLA divergence, fine-mapping association analysis with logistic regression, embedding with UMAP, and visualizing allele frequencies from the Allele Frequency Net Database (AFND) @Gonzalez-Galarza2020. + += Related Work + +BIGDAWG is an R package that provides functions for chi-squared Hardy-Weinberg and case-control association tests of highly polymorphic genetic data like HLA genotypes @Pappas2016. HATK is set of Python scripts for processing and analyzing IMGT-HLA data @Choi2020. = Acknowledgments -This work was supported by a NIAID grant T32AR007258 (to K.S.) and the National Institute of Health Director’s New Innovator Award (DP2CA247831; to A.C.V.) Thanks to Sreekar Mantena for reporting issues with the code. Thanks to Jean Fan for the logo and helpful discussions. +This work was supported by a NIAID grant T32AR007258 (to K.S.) and the National Institute of Health Director’s New Innovator Award (DP2CA247831; to A.C.V.) Thanks to Sreekar Mantena for reporting issues with the code. Thanks to Jean Fan for creating the logo and discussing the paper. = Competing Interests @@ -176,9 +183,4 @@ No competing interest is declared. K.S. wrote the software and the manuscript. A.C.V. reviewed the manuscript. -= Related Work - -BIGDAWG is an R package that provides functions for chi-squared Hardy-Weinberg and case-control association tests of highly polymorphic genetic data like HLA genotypes @Pappas2016. HATK is set of Python scripts for processing and analyzing IMGT-HLA data @Choi2020. - - #bibliography("references.bib") diff --git a/paper/references.bib b/paper/typst/references.bib similarity index 90% rename from paper/references.bib rename to paper/typst/references.bib index f36de68..b50df6b 100644 --- a/paper/references.bib +++ b/paper/typst/references.bib @@ -202,4 +202,19 @@ @article{Wakeland1990 year = {1990}, month = jun, pages = {115–122} -} \ No newline at end of file +} + +@article{Marsh2010, + title = {Nomenclature for factors of the HLA system, 2010}, + volume = {75}, + ISSN = {1399-0039}, + url = {http://dx.doi.org/10.1111/j.1399-0039.2010.01466.x}, + DOI = {10.1111/j.1399-0039.2010.01466.x}, + number = {4}, + journal = {Tissue Antigens}, + publisher = {Wiley}, + author = {Marsh, S. G. E. and Albert, E. D. and Bodmer, W. F. and Bontrop, R. E. and Dupont, B. and Erlich, H. A. and Fernández‐Viña, M. and Geraghty, D. E. and Holdsworth, R. and Hurley, C. K. and Lau, M. and Lee, K. W. and Mach, B. and Maiers, M. and Mayr, W. R. and M\"{u}ller, C. R. and Parham, P. and Petersdorf, E. W. and Sasazuki, T. and Strominger, J. L. and Svejgaard, A. and Terasaki, P. I. and Tiercy, J. M. and Trowsdale, J.}, + year = {2010}, + month = mar, + pages = {291–455} +}