PathoSurveilR
is an R package with functions that can read, summarize, plot,
and manipulate data produced by the pipeline
pathogensurveillance
.
Although PathoSurveilR
is not on CRAN yet, you can install the development
version from the source code on Github:
install.packages("devtools")
devtools::install_github("grunwaldlab/PathoSurveilR")
Most functions in the PathoSurveilR
package have the same way of accepting
input. Given one or more directory paths, functions will find their
needed input in any pathogensurveillance
output directories that are
in the given directories. For example, an output directory of
pathogensurveillance
is included in the package and its path on your
computer can be found like so:
library(PathoSurveilR)
path <- system.file('extdata/ps_output', package = 'PathoSurveilR')
print(path)
## [1] "/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output"
This path can then be used as the only required input for nearly every
function in PathoSurveilR
. For example, here is how to get the multigene
phylogeny plot, which contains core gene phylogenies for prokaryotes and
busco phylogenies for eukaryotes:
multigene_tree_plot(path)[[2]] # This function returns a list of plots, so [[2]] selects the second plot
And here is how to get the best matches for each sample using an estimate of ANI:
estimated_ani_match_table(path)
Sample | Closest reference | Reference ANI (%) | Closest sample | Sample ANI (%) |
---|---|---|---|---|
LM1 | Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293 | 0.9962 | LF5 | 0.8109 |
OT1 | GCF_000214015.3 | 0.9816 | OT3 | 0.9834 |
VC2 | Vibrio cholerae ATCC 14035 | 0.9939 | VC1 | 0.9844 |
LR1 | Limosilactobacillus reuteri subsp. reuteri JCM 1112 | 1.000 | LR2 | 0.9696 |
PF3 | GCF_000002765.6 | 0.9948 | PF1 | 0.9899 |
LW1 | Listeria welshimeri | 0.9920 | LR2 | 0 |
PF2 | GCF_000002765.6 | 0.9856 | PF3 | 0.9834 |
OT2 | GCF_000214015.3 | 0.9802 | OT3 | 0.9995 |
LF3 | Limosilactobacillus fermentum | 0.9805 | LF2 | 0.9789 |
FF1 | Streptococcus pneumoniae | 0.9898 | LF3 | 0.8200 |
VC3 | Vibrio cholerae ATCC 14035 | 0.9843 | VC1 | 0.9844 |
LF2 | Limosilactobacillus fermentum | 0.9923 | LF1 | 0.9998 |
LF5 | Limosilactobacillus fermentum | 0.9825 | LF4 | 0.9894 |
LF4 | Limosilactobacillus fermentum | 0.9816 | LF5 | 0.9894 |
VC1 | Vibrio cholerae ATCC 14035 | 0.9845 | VC3 | 0.9844 |
OT3 | GCF_000214015.3 | 0.9801 | OT2 | 0.9995 |
PF1 | GCF_000002765.6 | 0.9925 | PF3 | 0.9899 |
LF1 | Limosilactobacillus fermentum | 0.9923 | LF2 | 0.9998 |
LR2 | Limosilactobacillus reuteri subsp. reuteri JCM 1112 | 0.9697 | LR1 | 0.9696 |
Many functions also have an option for interactive output for use in
HTML documents. Since this is a markdown document (README.md
),
interactive plots will not work but here is a screenshot of an
interactive plot showing the taxonomic distribution of sendsketch hits:
sendsketch_taxonomy_plot(path, interactive = TRUE)
You can also get more low level information from the pipeline results to
do custom analyses. For example, functions ending with _path
or
_path_data
give you the paths of various types of
pathogensuriveillance
outputs, returning vector
s or tibble
s of
paths respectively:
estimated_ani_matrix_path(path)
## [1] "/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/sourmash_ani_matrix.csv"
core_tree_path_data(path)
## # A tibble: 2 × 3
## report_group_id path cluster_id
## <chr> <chr> <chr>
## 1 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/ps… 1
## 2 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/ps… 2
sendsketch_path_data(path)
## # A tibble: 19 × 3
## report_group_id path sample_id
## <chr> <chr> <chr>
## 1 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… FF1
## 2 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF1
## 3 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF2
## 4 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF3
## 5 all /home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/psm… LF4
## # ℹ 14 more rows
You can also get parsed versions of all of these pathogensuriveillance
outputs using functions ending in parsed
:
sendsketch_parsed(path)
## # A tibble: 380 × 39
## sample_id report_group_id WKID KID ANI SSU SSULen Complt Contam Contam2
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 FF1 all 96.3 65.6… 99.9 . 0 100 3.69% 0.04%
## 2 FF1 all 31.4 21.4… 95.9 . 0 76.0 47.96% 0.04%
## 3 FF1 all 20.3 14.0… 94.4 . 0 67.1 53.39% 0.04%
## 4 FF1 all 20.7 13.7… 94.5 . 0 61.8 58.96% 0.12%
## 5 FF1 all 15 10.4… 93.4 . 0 61.2 57.56% 0.04%
## # ℹ 375 more rows
## # ℹ 29 more variables: uContam <chr>, Score <dbl>, `E-Val` <dbl>, Depth <dbl>,
## # Depth2 <dbl>, Volume <dbl>, RefHits <dbl>, Matches <dbl>, Unique <dbl>,
## # Unique2 <dbl>, Unique3 <dbl>, noHit <dbl>, Length <dbl>, TaxID <dbl>,
## # ImgID <dbl>, gBases <chr>, gKmers <chr>, gSize <chr>, gSeqs <dbl>,
## # GC <dbl>, rDiv <dbl>, qDiv <dbl>, rSize <dbl>, qSize <dbl>, cHits <dbl>,
## # taxName <chr>, file <chr>, seqName <chr>, taxonomy <chr>
core_tree_parsed(path)
## $`/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/core_gene_trees/all_cluster_1.treefile`
##
## Phylogenetic tree with 5 tips and 4 internal nodes.
##
## Tip labels:
## GCF_019703835_1, GCF_000621645_1, VC2, VC3, VC1
## Node labels:
## Root, 97, 100,
##
## Rooted; includes branch length(s).
##
## $`/home/fosterz/R/x86_64-pc-linux-gnu-library/4.4/PathoSurveilR/extdata/ps_output/core_gene_trees/all_cluster_2.treefile`
##
## Phylogenetic tree with 21 tips and 20 internal nodes.
##
## Tip labels:
## GCF_900187225_1, GCF_900187315_1, LW1, GCF_001832905_1, FF1, GCF_001457635_1, ...
## Node labels:
## Root, 100, 100, 100, 100, 100, ...
##
## Rooted; includes branch length(s).
Functions that use the same data always start with the same words, so if
you know what data out want to look at, you can see all the ways that
PathoSurveilR
can interact with it by typing PathoSurveilR::
in an IDE like
RStudio followed by the data type name and hit <TAB>
to see
autocomplete suggestions. For example PathoSurveilR::estimated_ani_
+
<TAB>
will show all of these functions:
estimated_ani_heatmap
estimated_ani_match_table
estimated_ani_matrix_path
estimated_ani_matrix_path_data
estimated_ani_matrix_parsed
This work is subject to the MIT License.
The following people contributed to PathoSurveilR
: Zachary S.L. Foster,
Martha Sudermann, Camilo Parada-Rojas, Logan K. Blair, Fernanda I.
Bocardo, Ricardo Alcalá-Briseño, Jeff H. Chang, and Niklaus J. Grünwald.
This work was supported by grants from USDA ARS (2072-22000-045-000-D) to Niklaus J. Grünwald, USDA NIFA (2021-67021-34433; 2023-67013-39918) to Jeff H. Chang and Niklaus J. Grünwald, as well as USDAR ARS NPDRS and FNRI and USDA APHIS to Niklaus J. Grünwald
We welcome suggestions, bug reports and contributions! Make and issue on this repository to get in contact with us.