Skip to content

Tools to find and manipulate the output of pathogensurveillance

License

Notifications You must be signed in to change notification settings

grunwaldlab/PathoSurveilR

Repository files navigation

PathoSurveilR: an R package for analysis of the pathogensurveillance pipeline

PathoSurveilR is an R package with functions that can read, summarize, plot, and manipulate data produced by the pipeline pathogensurveillance.

Installation

Although PathoSurveilR is not on CRAN yet, you can install the development version from the source code on Github:

install.packages("devtools")
devtools::install_github("grunwaldlab/PathoSurveilR")

Introduction

Most functions in the PathoSurveilR package have the same way of accepting input. Given one or more directory paths, functions will find their needed input in any pathogensurveillance output directories that are in the given directories. For example, an output directory of pathogensurveillance is included in the package and its path on your computer can be found like so:

library(PathoSurveilR)
path <- system.file('extdata/ps_output', package = 'PathoSurveilR')
print(path)
## [1] "/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output"

This path can then be used as the only required input for nearly every function in PathoSurveilR. For example, here is how to get the multigene phylogeny plot, which contains core gene phylogenies for prokaryotes and busco phylogenies for eukaryotes:

multigene_tree_plot(path)[[2]]  # This function returns a list of plots, so [[2]] selects the second plot

And here is how to get the best matches for each sample using an estimate of ANI:

estimated_ani_match_table(path)
Sample Closest reference Reference ANI (%) Closest sample Sample ANI (%)
LM1 Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293 0.9962 LF5 0.8109
OT1 GCF_000214015.3 0.9816 OT3 0.9834
VC2 Vibrio cholerae ATCC 14035 0.9939 VC1 0.9844
LR1 Limosilactobacillus reuteri subsp. reuteri JCM 1112 1.000 LR2 0.9696
PF3 GCF_000002765.6 0.9948 PF1 0.9899
LW1 Listeria welshimeri 0.9920 LR2 0
PF2 GCF_000002765.6 0.9856 PF3 0.9834
OT2 GCF_000214015.3 0.9802 OT3 0.9995
LF3 Limosilactobacillus fermentum 0.9805 LF2 0.9789
FF1 Streptococcus pneumoniae 0.9898 LF3 0.8200
VC3 Vibrio cholerae ATCC 14035 0.9843 VC1 0.9844
LF2 Limosilactobacillus fermentum 0.9923 LF1 0.9998
LF5 Limosilactobacillus fermentum 0.9825 LF4 0.9894
LF4 Limosilactobacillus fermentum 0.9816 LF5 0.9894
VC1 Vibrio cholerae ATCC 14035 0.9845 VC3 0.9844
OT3 GCF_000214015.3 0.9801 OT2 0.9995
PF1 GCF_000002765.6 0.9925 PF3 0.9899
LF1 Limosilactobacillus fermentum 0.9923 LF2 0.9998
LR2 Limosilactobacillus reuteri subsp. reuteri JCM 1112 0.9697 LR1 0.9696

Many functions also have an option for interactive output for use in HTML documents. Since this is a markdown document (README.md), interactive plots will not work but here is a screenshot of an interactive plot showing the taxonomic distribution of sendsketch hits:

sendsketch_taxonomy_plot(path, interactive = TRUE)

You can also get more low level information from the pipeline results to do custom analyses. For example, functions ending with _path or _path_data give you the paths of various types of pathogensuriveillance outputs, returning vectors or tibbles of paths respectively:

estimated_ani_matrix_path(path)
## [1] "/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/sourmash_ani_matrix.csv"
core_tree_path_data(path)
## # A tibble: 2 × 3
##   report_group_id path                                                cluster_id
##   <chr>           <chr>                                               <chr>     
## 1 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/ps… 1         
## 2 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/ps… 2
sendsketch_path_data(path)
## # A tibble: 19 × 3
##   report_group_id path                                                 sample_id
##   <chr>           <chr>                                                <chr>    
## 1 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… FF1      
## 2 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF1      
## 3 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF2      
## 4 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF3      
## 5 all             /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF4      
## # ℹ 14 more rows

You can also get parsed versions of all of these pathogensuriveillance outputs using functions ending in parsed:

sendsketch_parsed(path)
## # A tibble: 380 × 39
##   sample_id report_group_id  WKID KID     ANI SSU   SSULen Complt Contam Contam2
##   <chr>     <chr>           <dbl> <chr> <dbl> <chr>  <dbl>  <dbl> <chr>  <chr>  
## 1 FF1       all              96.3 65.6…  99.9 .          0  100   3.69%  0.04%  
## 2 FF1       all              31.4 21.4…  95.9 .          0   76.0 47.96% 0.04%  
## 3 FF1       all              20.3 14.0…  94.4 .          0   67.1 53.39% 0.04%  
## 4 FF1       all              20.7 13.7…  94.5 .          0   61.8 58.96% 0.12%  
## 5 FF1       all              15   10.4…  93.4 .          0   61.2 57.56% 0.04%  
## # ℹ 375 more rows
## # ℹ 29 more variables: uContam <chr>, Score <dbl>, `E-Val` <dbl>, Depth <dbl>,
## #   Depth2 <dbl>, Volume <dbl>, RefHits <dbl>, Matches <dbl>, Unique <dbl>,
## #   Unique2 <dbl>, Unique3 <dbl>, noHit <dbl>, Length <dbl>, TaxID <dbl>,
## #   ImgID <dbl>, gBases <chr>, gKmers <chr>, gSize <chr>, gSeqs <dbl>,
## #   GC <dbl>, rDiv <dbl>, qDiv <dbl>, rSize <dbl>, qSize <dbl>, cHits <dbl>,
## #   taxName <chr>, file <chr>, seqName <chr>, taxonomy <chr>
core_tree_parsed(path)
## $`/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/core_gene_trees/all_cluster_1.treefile`
## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
##   GCF_019703835_1, GCF_000621645_1, VC2, VC3, VC1
## Node labels:
##   Root, 97, 100, 
## 
## Rooted; includes branch length(s).
## 
## $`/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/core_gene_trees/all_cluster_2.treefile`
## 
## Phylogenetic tree with 21 tips and 20 internal nodes.
## 
## Tip labels:
##   GCF_900187225_1, GCF_900187315_1, LW1, GCF_001832905_1, FF1, GCF_001457635_1, ...
## Node labels:
##   Root, 100, 100, 100, 100, 100, ...
## 
## Rooted; includes branch length(s).

Functions that use the same data always start with the same words, so if you know what data out want to look at, you can see all the ways that PathoSurveilR can interact with it by typing PathoSurveilR:: in an IDE like RStudio followed by the data type name and hit <TAB> to see autocomplete suggestions. For example PathoSurveilR::estimated_ani_ + <TAB> will show all of these functions:

  • estimated_ani_heatmap
  • estimated_ani_match_table
  • estimated_ani_matrix_path
  • estimated_ani_matrix_path_data
  • estimated_ani_matrix_parsed

License

This work is subject to the MIT License.

Credits

The following people contributed to PathoSurveilR: Zachary S.L. Foster, Martha Sudermann, Camilo Parada-Rojas, Logan K. Blair, Fernanda I. Bocardo, Ricardo Alcalá-Briseño, Jeff H. Chang, and Niklaus J. Grünwald.

Funding

This work was supported by grants from USDA ARS (2072-22000-045-000-D) to Niklaus J. Grünwald, USDA NIFA (2021-67021-34433; 2023-67013-39918) to Jeff H. Chang and Niklaus J. Grünwald, as well as USDAR ARS NPDRS and FNRI and USDA APHIS to Niklaus J. Grünwald

Contributions and Support

We welcome suggestions, bug reports and contributions! Make and issue on this repository to get in contact with us.

About

Tools to find and manipulate the output of pathogensurveillance

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages