README.md

This repository contains all supplementary information for analyses reported in Kiepas et al. (2024) describing inconsistencies between taxonomies inferred using 16S and whole-genome identities in Streptomyces.

This repository is provided to enable both reproduction and independent exploration of the analysis reported in this manuscript.

Reporting Problems

Please report any issues or problems with this repository at the Issues page.

Contributors

This manuscript has the following contributors:

Angelika B. Kiepas - PhD Candidate, Univeristy of Strathclyde
Dr Leighton Pritchard - Strathclyde Chancellor's Fellow, Univeristy of Strathclyde
Prof Paul A. Hoskisson - Professor, Univeirsty of Strathclyde

Contact Us

How to reach us:

Angelika B Kiepas:
- Email: [email protected]
- X: @kiepczi
Leighton Pritchard:
- Email: [email protected]
- X: @widdowquinn
Paul A Hoskisson:
- Email: [email protected]
- X: @PaulHoskisson

Downloading Repository

If you wish to indepedently explore, reproduce and/or validate the analyses reported in the manuscipt, you can use git to clone this repository to your machine.

git clone https://github.com/sipbs-compbiol/Kiepas_et_al_2024_16S.git

Alternatively, click here to download the current state of this repository as a .zip file, then expand it in the usual way for your operating system, then change directory to the repository root.

cd Kiepas_et_al_2023_16S

Set Up

We strongly recommend to create a conda enviroment specific for this activity. For example, if you have cloned or downloaded the repository and navigated to its root directory, the commands below should set up an appropriate environment:

conda create --name streptomyces python=3.8 -y
conda activate streptomyces
conda install --file requirements.txt -y

You will need also to install the following software within the environment, and follow the installation instructions are appropriate for each program:

Due to repository size limits at GitHub we are unable to provide the complete set of 16S sequences and genomes used in this manuscipt in this repository. To access these FASTA and GenBank files you can access them on Zenodo at , and place them in the appropriate directories. NCBI refernce taxonomy is also available from Zenodo.

The 16S sequence data used in this study are also available from Greengenes v13.5, SILVA v138.1, RDP v11.5 and the NCBI under BioProject PRJNA33175. However, it is possible that more recent downloads may not contain exactly the same sequences as used in the manuscript. Once the sequences are dowloaded from Zenodo, please unzip all files, and place the content of supplementary_file_2_raw_16S_databases into supplementary_file_2
The genomes used in this study can be downloaded using the download_genomes.sh bash scipt provided in supplementary file 17. To ensure reproducibility, we strongly recommend to verify the accession IDs provided in supplementary file 2 and supplementary file 17 to confirm that the datasets match, and particularly that no extra sequences have been included in the download, before conducting analyses. Once the genome sequences are downloaded from Zenodo, please unzip all files, and place the content of supplementary_file_17_NCBI_streptomyces_genomes into supplementary_file_17/data.
NCBI refernce taxonomy was downloaded from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip. As this is a life document for the reproductibility sake we enourage to download the file from Zenodo, and place the contetnt of supplementary_file_6_NCBI_taxonomy into summplementary_file_6.

Repository Files

Here you can find a list of all supplementary files provided in this repository. current set of subfolders include:

Supplementary file 1: Generate figures using Python and R. Directory containing all data, Python and R scripts to generate figures for this manuscript. (93MB)

Supplementary file 2: Raw 16S rRNA public databases. Directory containing four separate .txt files with sequence IDs for public 16S rRNA databases used in this manuscript, and an additional .txt file with Greengenes sequence taxonomy information, and a python script used to map taxonomy information to sequences found in Greengenes v13.5. (82.2MB)

Supplementary file 3: Filtration of 16S rRNA public databases. Directory containing python script used for filtration of the raw databases, and generated outputs. (84.2MB)

Supplementary file 4: Cleaning of the filtered 16S rRNA local. Directory containing all bash and Python scripts used to clean the local full-length 16S rRNA local databases by removing redundant and poor quality 16S rRNA sequences. (109.8MB)

Supplementary file 5: Sequence Clustering. Directory containing a bash script used to cluster full-length cleaned local 16S rRNA Streptomyces local databases at various thresholds, and provides .txt files with accessions for representative sequences, and cluster members for each clustering threshold. (471.7MB)

Supplementary file 6: Analysis of taxonomic composition for each clustering threshold. Directory containing Python scripts, NCBI taxonomy input and all outputs generated used to determine the taxonomic composition for each clustering threshold. (52.7MB)

Supplementary file 7: Cluster sizes. Empirical cumulative plot showing cluster size generated for all clustering thresholds. (PDF 44KB)

Supplementary file 8: Cluster taxID abundance. Empirical cumulative plot for unique number of taxID present for all clustering thresholds. (PDF 9KB)

Supplementary file 9: MSA. Directory containing all python and bash scripts, and additional data needed to generate and clean MSA for phylogenetic analysis. (69.2MB)

Supplementary file 10: Phylogenetic reconstruction. Directory containing bash scripts used for phylogenetic reconstruction, and all generated outputs and log files. (76MB).

Supplementary file 11: Collapse branches. Directory containing jupyter notebook used for collapsing branches with the same species names, and the collapsed tree in newick format. (3.5MB)

Supplementary file 12: Phylogenetic tree. PDF file showing collapsed phylogenetic tree with marked branches with transfer bootstrap expectation support of >= 50%. (PDF 224KB)

Supplementary file 13: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces albus and Streptomyces griseus. (PDF 229KB)

Supplementary file 14: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces albulus, Streptomyces lydicus and Streptomyces venezuelae. (PDF 228KB)

Supplementary file 15: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces clavuligerus and Streptomyces coelicolor. (PDF 227KB)

Supplementary file 16: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of Streptomyces lavendulae, Streptomyces rimosus and Streptomyces scabiei. (PDF 228KB)

Supplementary file 17: Streptomyces genomes. Directory containing bash scripts used to download Streptomyces genomes, and Python scripts used to check assembly status. The directory also contains two separate .txt files with Streptomyces genomes used in this manuscript: one file with all initial candidates, and a second file with replaced genomes. (20.2MB)

Supplementary file 18: Extraction of full-length and ambiguity free 16S rRNA sequences from Streptomyces genomes. Directory containing all Python and bash scripts used to extract full-length sequences from the filtered Streptomyces genomes. A single FASTA file with all extracted 16S rRNA sequences, and a single FASTA file with filtered sequences. A .txt file with accession of genomes retained in the analysis. (28.2MB)

Supplementary file 19: ANI analysis among Streptomyces genomes with identical 16S rRNA sequences. Directory containing all bash and Python scripts used to determine taxonomic boundaries among Streptomyces genomes sharing identical full-length 16S rRNA sequences. All output and pyANI log files. (52.8MB)

Supplementary file 20: Network analysis of genomes based on shared 16S sequences. Directory containing jupyter notebook with NetworkX analysis and all associated output files including. bash script for pyANI analysis runs on all connected components and all associated matrices, heatmaps and log files. (106.9MB)

Supplementary file 21: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences with each node colour corresponding to the number of connections/degrees. (HTML 4.7MB)

Supplementary file 22: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences showing clique (blue) and non-clique (green) components. (HTML 4.7MB)

Supplementary file 23: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S sequences showing number of unique genera within each connected component. Each candidate genus is represented as a single node colour within a connected component. (HTML 4.7MB)

Supplementary file 24: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S rRNA sequences showing number of unique species within each connected component. Each candidate species is represented as a single node colour within a connected component. (HTML 4.7MB)

Supplementary file 25: Interactive network graph. HTML file containing interactive network graph of genomes sharing common full-length 16S rRNA sequences showing number of unique NCBI names within each connected component. Each NCBI assigned name is represented as a single node colour within a connected component. Gray nodes represent genomes currently lacking assigned species names. (HTML 4.7MB)

Supplementary file 26: Intragenomic 16S rRNA heterogeneity within 1,369 Streptomyces genomes which exclusively contain only full-length and ambiguity symbol-free 16S rRNA sequences. A total of 811 genomes containing single 16S rRNA sequences are not shown. (PDF 8KB)

Supplementary file 27: Distribution of 16S copies per genome with a distinction between unique and total copies for genomes at assembly level complete and chromosome. (PDF 7KB)

Supplementary file 28: Schematic workflow for construction of the full-length 16S rRNA Streptomyces phylogeny. Each arrow represents a process and is annotated with script used and corresponding supplementary file. Output/data files, and the number of remaining sequences after each step, are indicated by rectangles. The green shading represents a single processing step of collecting and collating 16S database sequences. (PDF 91KB)

Supplementary file 29: Schematic representation of the pipeline used to filter publicly available Streptomyces genomes. (PDF 59KB)

Supplementary file 30: Sankey plot showing counts of taxonomic names in source databases, assigned at ranks from phylum to genus, to sequences identified with a key word ‘Streptomyces’ in the taxonomy field. Note that Actinobacteria and Actinobacteriota are synonyms in LPSN for the correct Phylum name Actinomycetota, but that Actinomycetales and Streptomycetales are not taxonomic synonyms for each other. Streptomycetales is synonymous in LPSN with the correct name Kitasatosporales; Actinomycetales is a distinct taxonomic Order. The parent order of the Family Streptomycetaceae in LPSN is Kitasatosporales. (PDF 64KB)

Supplementary file 31: Rectangular phylogram of the comprehensive maximum-likelihood tree of the genus Streptomyces based on the 16S sequence diversity of all 5,064 full-length 16S rRNA sequences with 100 TBE values. (PDF 194KB)

Supplementary file 32: Genomes sharing identical 16S rRNA sequences are assigned different names in NCBI. A total of 1,030 singleton clusters are not shown. (PDF 8KB)

Supplementary File 33: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Acintacidiphila genus. (PDF 228KB) Supplementary File 34: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Phaeacidiphilus genus. (PDF 228KB) Supplementary File 35: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Mangrovactinospora genus. (PDF 228KB) Supplementary File 36: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Wenjunlia genus. (PDF 228KB) Supplementary File 37: Phylogenetic tree. PDF file showing collapsed phylogenetic tree showing distribution of members of the novel Streptantibioticus genus. (PDF 228KB)

Reproducing analyses (QUICK Guide)

Analysis of 16S sequences from SILVA, Greengenes, RDP and NCBI

To reproduce the analyses, and phylogenetic tree using 16S sequences downloaded from SILVA, Greengenes, RDP and NCBI, please run the following scipts in this order:

gg_map_taxonomy.py - (supplementary file 2) assign taxonomy to greengenes sequences
get_complete_strep_seq.py - (supplementary file 3) extract full-length (1200bp or more) Streptomyces 16S sequences, and standaralise base coding to thymine rather than uracil
check_nomenclature_hierarchy.py - (supplementary file 3; OPTIONAL) generate data providing nomenclature at ranks from phylum to genus
remove_redundancy.sh - (supplementary file 4) remove redundant sequences
get_ambiguity_stats_info.py - (supplementary file 4; OPTIONAL) calculate the variance and average count of ambiguity per sequence to choose an appropriate model
get_paraeters_for_qnbinom.py - (supplementary file 4; OPTIONAL) get parameters for qnbinom() function in R
qnbinom.R - (supplementary file 4; OPTIONAL) determine the threshold for excluding sequences that contain an excessive amount of ambiguous symbols
remove_ambiguity.py - (supplementary file 4) remove sequences with more than 153 ambiguity symbols
remove_chiemras.bash - (supplementary file 4) remove chimeric sequences
cluster_sequences.sh - (supplementary file 5) cluster sequences at theresholds varying between 98% and 100% in step of 0.1%
get_LSPN_status_and_NCBI_taxID.py - (supplementary file 6) validate nomenclature at species level (LSPN and NCBI taxID)
cluster_composiion_analysis.py - (supplementary file 6) investigate taxonomic comosition for each cluster generated at each clustering thereshold
add_outgroup.py - (supplementary file 9) add outgroups
align_seq_nextalign.sh - (supplementary file 9) align sequences
trim_alignment.sh - (supplementary file 9) trim the alignment
alignment_dereplication.sh - (supplementary file 9) dereplicate aligment
raxml_step.sh - (supplementary file 10) calulate ML tree
raxml_bootstrap.sh - (supplementary file 10) calculate 100 boostraps
raxml_tbe.sh - (supplementary file 10) get TBE values
collapse_branches_AK.ipynb - (supplementary file 11) collapse branches with the same names

Analysis of 16S sequences from Streptomyces genomes

To reproduce the analyses, and phylogenetic tree using 16S sequences downloaded from SILVA, Greengenes, RDP and NCBI, please run the following scipts in this order:

download_genomes.sh - (supplementary file 17) download Streptomyces genomes. NOTE: To ensure reproductibility, we stronly recommend to verify the accession IDs provided in supplementary file 17 to confirm that no extra sequences have been included or excluded before conducting analyses.
check_genome_status.py - (supplementary file 17) check assembly status. HERE: suppresent genomes were excluded, and replaced genomes were manually downloaded from NCBI aznd provided in supplementary file 17
extract_16S.py - (supplementary file 18) extract 16S sequences from downloaded genomes
filter_16S_seq.py - (supplementary file 18) filter 16S sequences to retain genomes that exclusively contain only full-length and ambiguity symbol free 16S sequences
align_sequences_with_nextalign.sh - (supplementary file 18) align the retained sequences with netxalign
trin_alignment.sh - (supplementary file 18) trim alignments with trimAl
get_input_genomes_for_pyani.py - (supplementary file 19) get genome clusters; genomes that share identical 16S sequences
pyani_analysis.sh - (supplementary file 19) run pyANI anlysis on the generated clusters
get_pyani_hetamaps.sh - (supplementary file 19; OPTIONAL) get pyANI heatmpas
get_pyani_comparision_matrix.sh - (supplementary file 20l OPTIONAL) get pyANI matrices for each comparsion
generate_viz_data_pyani_comparisions.py - (supplementary file 19; OPTIONAL) generate dataframe for pyANI comparions that can be later used to genearte scatterplots
genome_16S_NetworkX.ipynb - (supplementary file 20) generate network to visually represent connections between all 1369 that contain only full-length and ambiguity symbol free 16S rRNA sequences
pyani_analysis.sh - (supplementary file 20) determine taxonomic boundaries for all genomes found in the same connected component/are connected by sharing 16S sequences

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
supplementary_files		supplementary_files
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supplementary_files

supplementary_files

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

requirements.txt

requirements.txt

Repository files navigation

README.md - Kiepas_et_al_2024_16S: 16S taxonomy and clustering is not a proxy for taxonomy in Streptomyces

Table of contents

Reporting Problems

Contributors

Contact Us

Downloading Repository

Set Up

Repository Files

Reproducing analyses (QUICK Guide)

Analysis of 16S sequences from SILVA, Greengenes, RDP and NCBI

Analysis of 16S sequences from Streptomyces genomes

About

Releases 1

Packages

Contributors 2

License

sipbs-compbiol/Kiepas_et_al_2024_16S

Folders and files

Latest commit

History

Repository files navigation

README.md - Kiepas_et_al_2024_16S: 16S taxonomy and clustering is not a proxy for taxonomy in Streptomyces

Table of contents

Reporting Problems

Contributors

Contact Us

Downloading Repository

Set Up

Repository Files

Reproducing analyses (QUICK Guide)

Analysis of 16S sequences from SILVA, Greengenes, RDP and NCBI

Analysis of 16S sequences from Streptomyces genomes

About

Topics

Resources

License

Stars

Watchers

Forks