Skip to content
Pratyay Sengupta edited this page Sep 26, 2024 · 6 revisions

Welcome to the Neobacillus wiki!

1. Genome assembly

We checked the quality of the reads using FastQC v.0.11.7 (https://github.com/s-andrews/FastQC) and pre-process using fastp v.0.20.0 (https://github.com/OpenGene/fastp)

# FastQC for quality check
fastqc JC-1475_S192_L003_R1_001.fastq.gz JC-1475_S192_L003_R2_001.fastq.gz -o Nd_fastqc_output_path/

# fastp for filtering out low-quality sequences and adapter contamination
fastp -i JC-1475_S192_L003_R1_001.fastq.gz -I JC-1475_S192_L003_R1_001.fastq.gz -o filtered_JC-1475_S192_L003_R1_001.fastq.gz -O filtered_JC-1475_S192_L003_R1_001.fastq.gz --detect_adapter_for_pe

We constructed initial Illumina assembly with SPAdes v.3.11.1 (https://github.com/ablab/spades)

spades.py -1 filtered_JC-1475_S192_L003_R1_001.fastq.gz -2 filtered_JC-1475_S192_L003_R1_001.fastq.gz -o Nd_spades_output_path

ONT sequence quality control using Porechop v.0.2.4 (https://github.com/rrwick/Porechop) and Nanofilt v.2.3.0 (https://github.com/wdecoster/nanofilt)

# Porechop
porechop -i 179-J_C42_HS_ONT.fastq -o trimmed_179-J_C42_HS_ONT.fastq

# Nanofilt
nanofilt -q 10 -l 200 < trimmed_179-J_C42_HS_ONT.fastq > filtered_179-J_C42_HS_ONT.fastq

We performed Hybrid assembly using Unicycler v.0.5.0 (https://github.com/rrwick/Unicycler) combining Illumina and ONT reads

# Hybrid assembly using Unicycler
unicycler -1 filtered_JC-1475_S192_L003_R1_001.fastq.gz -2 filtered_JC-1475_S192_L003_R1_001.fastq.gz -l filtered_179-J_C42_HS_ONT.fastq -o Nd_unicycler_output_path

We check the quality of the assembly using QUAST v.5.0.2 (https://github.com/ablab/quast) and CheckM v.1.2.2 (https://github.com/Ecogenomics/CheckM)

# Genome assembly quality check with QUAST
quast.py Nd_unicycler_output_path/Neobacillus_genome_hy.fasta -o quast_output

# Genome completeness and contamination check with CheckM
checkm lineage_wf d_unicycler_output_path/Neobacillus_genome_hy.fasta checkm_output

2. Overall Genome Relatedness Indices (OGRI) calculation

We collected all validly published representative genomes of Neobacillus from NCBI using the command line tool bit (https://github.com/AstrobioMike/bit) and NCBI Entrez-direct

# installation
conda install -c bioconda entrez-direct 
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike bit 

# esearch to get the list of representative species accession IDs
esearch -query 'Neobacillus[ORGN] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > Neobacillus_repr.txt

# bit to fetch the genomes
bit-dl-ncbi-assemblies -w Neobacillus_repr.txt -f fasta -j 100

# unzipping
gunzip *.gz

We calculated the Average Nucleotide Identity of the novel species with the other type strains using FastANI v.1.34 (https://github.com/ParBLiSS/FastANI)

# get all the paths of the genomes
find ${genome_dir} -iname *.fasta > Neobacillus_Genomes_Paths.txt

# perform all vs all ANI with the genomes
fastANI --rl Neobacillus_Genomes_Paths.txt --ql Neobacillus_Genomes_Paths.txt --threads 50 --matrix -o Neobacillus_Genomes_ANI.tsv

Further, Average Amino acid Identity (AAI) values were computed using aai.rb function from the Enveomics collection toolbox (https://github.com/lmrodriguezr/enveomics)

aai.rb --seq1 <genome> --seq2 <ref_genome> --threads 50 --res AAI_output.txt

We annotated the novel strains using the widely-used command-line annotation tool Prokka (https://github.com/tseemann/prokka)

# prokka command
# From the directory of the .fasta files
for file in *.fasta; do prokka --outdir ../Prokka_Annotation/"$file" --prefix "$file" --cpus 40 "$file"; done

From the annotated genomes, 16S rRNA and DNA gyrase subunit-B (gyrB) genes were extracted for the novel species and the closest type strain, and sequence identity was estimated using Blast v.2.13.0

# 16S rRNA sequence extraction
pattern="16S ribosomal RNA"

IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
	filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
	echo "Getting 16S sequence for: $i"
	# Use awk to find and print the longest sequence between headers containing the pattern
	awk -v pattern="$pattern" '
	 BEGIN { RS=">"; FS="\n" } 
	 tolower($1) ~ tolower(pattern) {
	   sequence = $0
	   gsub(/\n/, "", sequence)  # Remove newline characters
	   length_seq = length(sequence)
	   
	   # Update max_length and max_sequence if the current sequence is longer
	   if (length_seq > max_length) {
	     max_length = length_seq
	     max_sequence = ">" $0
	   }
	 }
	 END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_16S.fasta"
done
# gyrB sequence extraction
pattern_gyrb="DNA gyrase subunit B"

echo "Getting gyrB sequences"

IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
	filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
	echo "Getting 16S sequence for: $i"
	# Use awk to find and print the longest sequence between headers containing the pattern
	awk -v pattern="$pattern_gyrb" '
	 BEGIN { RS=">"; FS="\n" } 
	 tolower($1) ~ tolower(pattern) {
	   sequence = $0
	   gsub(/\n/, "", sequence)  # Remove newline characters
	   length_seq = length(sequence)
	   
	   # Update max_length and max_sequence if the current sequence is longer
	   if (length_seq > max_length) {
	     max_length = length_seq
	     max_sequence = ">" $0
	   }
	 }
	 END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_gyrB.fasta"
done
# BLAST
blastn -db 16S_ribosomal_RNA -query $i -task blastn -dust no -max_target_seqs 1 -outfmt "10 delim=, qacc sacc evalue bitscore qcovus pident staxid ssciname" -num_threads 40

3. WGS-based phylogeny

We utilised GToTree v.1.8.2 (https://github.com/AstrobioMike/GToTree) for preparing the phylogenetic trees using single-copy core genes, followed by IQTREE-2 v.2.2.0.3 (https://github.com/iqtree/iqtree2) to perform 1000 ultrafast bootstrap replicates

# GToTree
GToTree -a Neobacillus_non_novel_accession.txt -f Neobacillus_paths.txt -H Firmicutes -t -L Species,Strain -T IQ-TREE -j 50 -m GToTree_Mapping_Files/Neobacillus_GToTree_Mapping.tsv -o Neobacillus

# IQTREE-2
iqtree2 -s Neobacillus/Aligned_SCGs_mod_names.faa -spp Neobacillus/run_files/Partitions.txt -m txt -m MFP -bb 1000 -nt 40 -pre Neobacillus_IQ_Tree

4. Genome characterisation

We identified the Cluster of Orthologous Genes (COGs) using the Prokka annotated genomes. We used python-package cogclassifier https://pypi.org/project/cogclassifier/)

# cogclassfier command
# Once Prokka is over, utilise .faa files for COG identification
for file in ./*/*.faa; do cp "$file" ./../COGs/; done

# From the directory of the .faa files
for files in *.faa; do COGclassifier -i "$files" -o ./COG_Output/"$files"; done

We then predicted, annotated and analysed the secondary metabolite biosynthesis gene clusters in these novel strains using command-line antiSMASH 7.0 (https://antismash.secondarymetabolites.org/#!/about)

# antismash command
# From the directory of the .fasta files
for file in *.fasta; do antismash "$file" --taxon bacteria --cpus 20 --genefinding-tool prodigal --cb-knownclusters --output-dir ./../antiSMASH/"$file" --cc-mibig --fullhmmer; done

Further, we employed the command-line tool Resistance Gene Identifier (RGI) (https://github.com/arpcard/rgi) to predict antibiotic resistome(s) in these novel genomes. We searched these antibiotic resistance genes (ARGs) against the Comprehensive Antibiotic Resistance Database (CARD)(https://card.mcmaster.ca/)

# rgi command
# From the directory of the .fasta files
for file in *.fasta; do rgi main -i "$file" -o ./RGI_CARD/"$file" -t contig -a BLAST -n 50; done

5. Mapping metagenomics samples to novel isolates

We used fastp v.0.22.0 (https://github.com/OpenGene/fastp) to filter the raw shotgun sequences obtained from NASA cleanrooms

fastp.0.23.4 --in1 ${metagenomeFolder}/$read1 --out1 ${read1}_output.fastq --in2 ${metagenomeFolder}/{read2} --out2 ${read2}_output.fastq -w 16 --json ${metagenomeFolder}_fastp.json --html ${metagenomeFolder}_fastp.html

Then we utilised MetaCompass v.2.0 (https://github.com/marbl/MetaCompass) to align the filtered reads to newly identified genomes and determine their abundance in the NASA cleanrooms based on mapped reads

python3 ../../MetaCompass/go_metacompass.py -r $genome -1 "$read1" -2 "$read2" -l 150 -t 70 -y 50 -o Metagenomic_Mapping/${name}_${i}