Skip to content

Commit 24d6ded

Browse files
author
Harihara Subrahmaniam Muralidharan
committed
Merge branch 'main' of https://github.com/hsmurali/SCRAPT
2 parents 2cdc67f + a23ee86 commit 24d6ded

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

Readme.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,5 +75,24 @@ optional named arguments:
7575
[DEFAULT = 1000]
7676
```
7777

78+
[![DOI](https://zenodo.org/badge/424442689.svg)](https://zenodo.org/badge/latestdoi/424442689)
79+
7880
## References
7981
Ghodsi, M., Liu, B. & Pop, M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinformatics 12, 271 (2011). https://doi.org/10.1186/1471-2105-12-271
82+
83+
If you use ```SCRAPT``` for your research, please cite the article published in Nucleic Acids Research.
84+
```
85+
@article{10.1093/nar/gkad158,
86+
author = {Luan, Tu and Muralidharan, Harihara Subrahmaniam and Alshehri, Marwan and Mittra, Ipsa and Pop, Mihai},
87+
title = "{SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets}",
88+
journal = {Nucleic Acids Research},
89+
year = {2023},
90+
month = {03},
91+
abstract = "{16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.}",
92+
issn = {0305-1048},
93+
doi = {10.1093/nar/gkad158},
94+
url = {https://doi.org/10.1093/nar/gkad158},
95+
note = {gkad158},
96+
eprint = {https://academic.oup.com/nar/advance-article-pdf/doi/10.1093/nar/gkad158/49515274/gkad158.pdf},
97+
}
98+
```

0 commit comments

Comments
 (0)