-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the Neobacillus wiki!
We checked the quality of the reads using FastQC v.0.11.7 (https://github.com/s-andrews/FastQC) and pre-process using fastp v.0.20.0 (https://github.com/OpenGene/fastp)
# FastQC for quality check
fastqc JC-1475_S192_L003_R1_001.fastq.gz JC-1475_S192_L003_R2_001.fastq.gz -o Nd_fastqc_output_path/
# fastp for filtering out low-quality sequences and adapter contamination
fastp -i JC-1475_S192_L003_R1_001.fastq.gz -I JC-1475_S192_L003_R1_001.fastq.gz -o filtered_JC-1475_S192_L003_R1_001.fastq.gz -O filtered_JC-1475_S192_L003_R1_001.fastq.gz --detect_adapter_for_peWe constructed initial Illumina assembly with SPAdes v.3.11.1 (https://github.com/ablab/spades)
spades.py -1 filtered_JC-1475_S192_L003_R1_001.fastq.gz -2 filtered_JC-1475_S192_L003_R1_001.fastq.gz -o Nd_spades_output_pathONT sequence quality control using Porechop v.0.2.4 (https://github.com/rrwick/Porechop) and Nanofilt v.2.3.0 (https://github.com/wdecoster/nanofilt)
# Porechop
porechop -i 179-J_C42_HS_ONT.fastq -o trimmed_179-J_C42_HS_ONT.fastq
# Nanofilt
nanofilt -q 10 -l 200 < trimmed_179-J_C42_HS_ONT.fastq > filtered_179-J_C42_HS_ONT.fastqWe performed Hybrid assembly using Unicycler v.0.5.0 (https://github.com/rrwick/Unicycler) combining Illumina and ONT reads
# Hybrid assembly using Unicycler
unicycler -1 filtered_JC-1475_S192_L003_R1_001.fastq.gz -2 filtered_JC-1475_S192_L003_R1_001.fastq.gz -l filtered_179-J_C42_HS_ONT.fastq -o Nd_unicycler_output_pathWe check the quality of the assembly using QUAST v.5.0.2 (https://github.com/ablab/quast) and CheckM v.1.2.2 (https://github.com/Ecogenomics/CheckM)
# Genome assembly quality check with QUAST
quast.py Nd_unicycler_output_path/Neobacillus_genome_hy.fasta -o quast_output
# Genome completeness and contamination check with CheckM
checkm lineage_wf d_unicycler_output_path/Neobacillus_genome_hy.fasta checkm_outputWe collected all validly published representative genomes of Neobacillus from NCBI using the command line tool bit (https://github.com/AstrobioMike/bit) and NCBI Entrez-direct
# installation
conda install -c bioconda entrez-direct
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike bit
# esearch to get the list of representative species accession IDs
esearch -query 'Neobacillus[ORGN] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > Neobacillus_repr.txt
# bit to fetch the genomes
bit-dl-ncbi-assemblies -w Neobacillus_repr.txt -f fasta -j 100
# unzipping
gunzip *.gzWe calculated the Average Nucleotide Identity of the novel species with the other type strains using FastANI v.1.34 (https://github.com/ParBLiSS/FastANI)
# get all the paths of the genomes
find ${genome_dir} -iname *.fasta > Neobacillus_Genomes_Paths.txt
# perform all vs all ANI with the genomes
fastANI --rl Neobacillus_Genomes_Paths.txt --ql Neobacillus_Genomes_Paths.txt --threads 50 --matrix -o Neobacillus_Genomes_ANI.tsv
Further, Average Amino acid Identity (AAI) values were computed using aai.rb function from the Enveomics collection toolbox (https://github.com/lmrodriguezr/enveomics)
aai.rb --seq1 <genome> --seq2 <ref_genome> --threads 50 --res AAI_output.txtWe annotated the novel strains using the widely-used command-line annotation tool Prokka (https://github.com/tseemann/prokka)
# prokka command
# From the directory of the .fasta files
for file in *.fasta; do prokka --outdir ../Prokka_Annotation/"$file" --prefix "$file" --cpus 40 "$file"; done
From the annotated genomes, 16S rRNA and DNA gyrase subunit-B (gyrB) genes were extracted for the novel species and the closest type strain, and sequence identity was estimated using Blast v.2.13.0
# 16S rRNA sequence extraction
pattern="16S ribosomal RNA"
IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
echo "Getting 16S sequence for: $i"
# Use awk to find and print the longest sequence between headers containing the pattern
awk -v pattern="$pattern" '
BEGIN { RS=">"; FS="\n" }
tolower($1) ~ tolower(pattern) {
sequence = $0
gsub(/\n/, "", sequence) # Remove newline characters
length_seq = length(sequence)
# Update max_length and max_sequence if the current sequence is longer
if (length_seq > max_length) {
max_length = length_seq
max_sequence = ">" $0
}
}
END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_16S.fasta"
done# gyrB sequence extraction
pattern_gyrb="DNA gyrase subunit B"
echo "Getting gyrB sequences"
IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
echo "Getting 16S sequence for: $i"
# Use awk to find and print the longest sequence between headers containing the pattern
awk -v pattern="$pattern_gyrb" '
BEGIN { RS=">"; FS="\n" }
tolower($1) ~ tolower(pattern) {
sequence = $0
gsub(/\n/, "", sequence) # Remove newline characters
length_seq = length(sequence)
# Update max_length and max_sequence if the current sequence is longer
if (length_seq > max_length) {
max_length = length_seq
max_sequence = ">" $0
}
}
END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_gyrB.fasta"
done# BLAST
blastn -db 16S_ribosomal_RNA -query $i -task blastn -dust no -max_target_seqs 1 -outfmt "10 delim=, qacc sacc evalue bitscore qcovus pident staxid ssciname" -num_threads 40
We utilised GToTree v.1.8.2 (https://github.com/AstrobioMike/GToTree) for preparing the phylogenetic trees using single-copy core genes, followed by IQTREE-2 v.2.2.0.3 (https://github.com/iqtree/iqtree2) to perform 1000 ultrafast bootstrap replicates
# GToTree
GToTree -a Neobacillus_non_novel_accession.txt -f Neobacillus_paths.txt -H Firmicutes -t -L Species,Strain -T IQ-TREE -j 50 -m GToTree_Mapping_Files/Neobacillus_GToTree_Mapping.tsv -o Neobacillus
# IQTREE-2
iqtree2 -s Neobacillus/Aligned_SCGs_mod_names.faa -spp Neobacillus/run_files/Partitions.txt -m txt -m MFP -bb 1000 -nt 40 -pre Neobacillus_IQ_Tree
We identified the Cluster of Orthologous Genes (COGs) using the Prokka annotated genomes. We used python-package cogclassifier https://pypi.org/project/cogclassifier/)
# cogclassfier command
# Once Prokka is over, utilise .faa files for COG identification
for file in ./*/*.faa; do cp "$file" ./../COGs/; done
# From the directory of the .faa files
for files in *.faa; do COGclassifier -i "$files" -o ./COG_Output/"$files"; done
We then predicted, annotated and analysed the secondary metabolite biosynthesis gene clusters in these novel strains using command-line antiSMASH 7.0 (https://antismash.secondarymetabolites.org/#!/about)
# antismash command
# From the directory of the .fasta files
for file in *.fasta; do antismash "$file" --taxon bacteria --cpus 20 --genefinding-tool prodigal --cb-knownclusters --output-dir ./../antiSMASH/"$file" --cc-mibig --fullhmmer; done
Further, we employed the command-line tool Resistance Gene Identifier (RGI) (https://github.com/arpcard/rgi) to predict antibiotic resistome(s) in these novel genomes. We searched these antibiotic resistance genes (ARGs) against the Comprehensive Antibiotic Resistance Database (CARD)(https://card.mcmaster.ca/)
# rgi command
# From the directory of the .fasta files
for file in *.fasta; do rgi main -i "$file" -o ./RGI_CARD/"$file" -t contig -a BLAST -n 50; done
We used fastp v.0.22.0 (https://github.com/OpenGene/fastp) to filter the raw shotgun sequences obtained from NASA cleanrooms
fastp.0.23.4 --in1 ${metagenomeFolder}/$read1 --out1 ${read1}_output.fastq --in2 ${metagenomeFolder}/{read2} --out2 ${read2}_output.fastq -w 16 --json ${metagenomeFolder}_fastp.json --html ${metagenomeFolder}_fastp.htmlThen we utilised MetaCompass v.2.0 (https://github.com/marbl/MetaCompass) to align the filtered reads to newly identified genomes and determine their abundance in the NASA cleanrooms based on mapped reads
python3 ../../MetaCompass/go_metacompass.py -r $genome -1 "$read1" -2 "$read2" -l 150 -t 70 -y 50 -o Metagenomic_Mapping/${name}_${i}