-
Notifications
You must be signed in to change notification settings - Fork 21
Creating a local refseq blast db
Blasting online sequence databases is a way to retrieve orthologs for a protein of interest. These results can be parsed further. However using the remote blast service can be slow. Using a local version is much faster. The example here is for creating a refseq protein db for bacterial genomes.
see http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/ for further info
Download the /refseq/bacteria/assembly_summary.txt file (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt)
- List the FTP path (column 20) for the assemblies of interest, in this case those that have "Complete Genome" assembly_level (column 12) and "latest" version_status (column 11). One way to do this would be using the following awk command:
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
- Append the filename of interest, in this case "*_protein.faa.gz" to the FTP directory names. One way to do this would be using the following awk command:
awk 'BEGIN{FS=OFS="/";filesuffix="protein.faa.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
- Use curl or wget to download the data file for each FTP path in the list, e.g:
wget -i ftpfilepaths
gunzip *.gz
cat *.faa > bac_refseq.fa
Make the local blast database:
makeblastdb -in bac_refseq.fa -out bacterial_refseq -dbtype prot
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary_refseq.txt > ftpdirpaths
awk 'BEGIN{FS=OFS="/";filesuffix="protein.faa.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
wget -i ftpfilepaths
gunzip *.gz
cat *.faa > bac_refseq.fa
makeblastdb -in bac_refseq.fa -out bacterial_refseq -dbtype prot
The same steps as above except we use the following file:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt
localdb = 'bacterial_refseq'
blr = analysis.getOrthologs(proteinseq, db=localdb)
alnrows, aln = analysis.alignBlastResults(blr)
The aligned sequences can then be passed to the following method for epitope analysis:
c = analysis.epitopeConservation(seqs, alnrows=alnrows)