Skip to content

Core genome comparison pipeline by Dr Zhemin Zhou

Notifications You must be signed in to change notification settings

msenghore/core_ope

Repository files navigation

The program "remap_qualifiedASM.pl" in it is used to remap the reads back to scaffolds/contigs and relace all the bases that is in low qualities. 
USAGE: perl /usr/local/bioinf/core_ope/remap_qualifiedASM.pl <scaffold file> <read 1> <read 2> <new file> <times of remapping (use 3)>

The program nucmer_filter.pl is used to compare scaffolds/finished genomes with a reference.
USAGE: perl /usr/local/bioinf/core_ope/nucmer_filter.pl <reference> <scaffold> <mapped sequences>

Here are three extra scripts. 
/usr/local/bioinf/core_ope/core_genome_build.pl
/usr/local/bioinf/core_ope/core_remove_region.pl
/usr/local/bioinf/core_ope/core_map2snp.pl
You have got series of files from previous steps. 
Then merge all the files that you are going to analyze together. Using:
cat <reference> <other files generated by nucmer_filter.pl> ... > <alignment file>
Note: put the reference be the first. 

To remove plasmids use (choosefasta):
choosefasta.pl <fasta> <list> [<remain>] > <output>
To confirm sequence lengths use (fastalength):
Synopsis:
--------
fastalength <path>

1) perl core_genome_build.pl <alignment file> > <core_genome file>
This will generate a file for an alignment of core genome. 
2) Identify repeat regions in reference file.
These included: 1) multiple segments of >50 bp dispersed throughout ATCC 9150 with a BLASTn hit of >94% identity when ATCC 9150 was blasted against itself; 2) VNTRs identified by Tandem Repeat Finder 4.04 [65]; 3) CRISPR1 and CRISPR2 regions. 
Save your results into a text file in the following format:
<region 1> <start site> <end site>
<region 2> <start site> <end site>
3) perl core_remove_region.pl <core_genome file> <repeat file> > <non repetitive core genome file>
4) perl core_map2snp.pl <non repetitive core genome file> > <non repetitive core SNP file>
Then you can get the core SNP results.
PS: That are most of the scripts I have written for the core genome analysis.

1/ If your reference has plasmids and your assemblies has similar plasmids, their alignment will be kept in the mapped file as well. For core genome analysis, you need only chromosome so you need to remove these plasmids and put the chromosomes together. 
2/ To remove "-" and "." in the mapped sequences, you can try using sed command in linux. 
USAGE: sed '/^[^>]/ s/[.-]//g' <input file> > <output file>
if you are going to remove other characters, just put them in between "[" and "]" in the USAGE command. 
However, it will always be better to build blast database based on original assemblies after the remapping process rather than this mapped sequences. These generated files contains only sequences that are similar to the referece. 

To convert SNP table to multifasta alignment:
It is: /usr/local/bioinf/core_ope/core_snp2fas.pl
USAGE: perl /usr/local/bioinf/core_ope/core_snp2fas.pl <snp file> > <aligned fasta file>
Note: use the file generated by map2snp.pl. The generated fasta file can be read in MEGA or Clustalx. You can use these programs to change its format. 
For nucleotide diversity in the tree, you can get it from MEGA. 
nucleotide diversity in specific clades? I think you can do it in MEGA as well, by determining particular groups. 
determine the ancestral node for the different clades
You can do this in either PAML or a program "ACE" in a R package named "APE”.

Tree Building

nohup raxmlHPC -s New_snp_alignment.phy -t RAxML_result.NewwStaphtree -m GTRGAMMA -f e -n Tree_with_branch_length &

Detecting Homoplasy

madikay@Esau:~$ Rscript /usr/local/bioinf/core_ope/RHMM_snp_allocate.R
Rscript RHMM_snp_allocate.R <file.SNP> <file.tree> <branch.rate> <file.rec> <rec.density> 

branch.rate is rate of mutations. #of mutations sites / size of core genome

About

Core genome comparison pipeline by Dr Zhemin Zhou

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages