VCF2Dis: A new simple and efficient software to calculate p-distance matrix and construct phylogeny based Variant Call Format input
-
- Install
The new version will be updated and maintained in hewm2008/VCF2Dis, please click below Link to download the latest version
Just sh make.sh
to compile. The executable VCF2Dis
can be found in the folder of bin/VCF2Dis
For Linux /Unix and macOS
tar -zxvf VCF2DisXXX.tar.gz # if Link do not work ,Try re-install [zlib]library cd VCF2DisXXX; # [zlib] and copy them to the library Dir sh make.sh; # VCF2Dis-xx/src/include/zlib ./bin/VCF2Dis
Note: If fail to link,try to re-install the libraries zlib
Note:: R with ape, dplyr and ggtree are recommended
-
- Main parameter description:
Usage: VCF2Dis -InPut <in.vcf> -OutPut <p_dis.mat>
-InPut <str> Input one or muti GATK VCF genotype File
-OutPut <str> OutPut Sample p-Distance matrix
-InList <str> Input GATK muti-chr VCF Path List
-SubPop <str> SubGroup SampleList of VCF File [ALLsample]
-Rand <float> Probability (0-1] for each site to join Calculation [1]
-help Show more help [hewm2008 v1.53s]
For more details, please use -help and see the example
-InFormat <str> Input File is [VCF/FA/PHY] Format,defaut: [VCF]
-InSampleGroup <str> InFile of sample Group info,format(sample groupA)
-TreeMethod <int> Construct Tree Method,1:NJ-tree 2:UPGMA-tree [1]
-KeepMF Keep the Middle File diff & Use matrix
Three examples were provided in the directory of example/Example*
-
- To Create the p_distance matrix and construct nj-tree newick tree
# 1.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis.mat
# ./bin/VCF2Dis -InPut in.fa.gz -OutPut p_dis.mat -InFormat FA
# 2.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list
./bin/VCF2Dis -InPut chr1.vcf.gz chr2.vcf.gz -OutPut p_dis.mat -SubPop sample.list
-
- Simple tree visualization (for advanced tree display and annotation please refer to
iTOL
,Evolview
,MEGA
)
you will obtain thep_dis.nwk
tree file and neighbor-joining tree in pdf formatp_dis.pdf
after VCF2Dis.
- Simple tree visualization (for advanced tree display and annotation please refer to
Note::if you can't get the p_dis.nwk tree file but had the p_dis.mat, here are the 3 methods to get the tree file.
-
- Running multiple times by using a method of sampling with replacement. Users can randomly select a part of the sites [-Rand] and construct a new nj-tree as above, and Repeat NN times [recommand NN=100]. X=(1,2....NN);
#!/bin/bash
NN=100
if [ "$#" -eq 1 ]; then
NN=$1
fi
for X in $(seq 1 $NN)
do
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis_${X}.mat -Rand 0.25
# PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_${X}.mat -outfile tree.out1_${X}.txt -matrixtype s -treetype n -outtreefile tree.out2_${X}.tre
done
-
- Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer to
iTOL
,Evolview
andMEGA
)
- Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer to
#!/bin/bash
NN=100
if [ "$#" -eq 1 ]; then
NN=$1
fi
cat p_*.nwk > alltree_merge.tre # cat tree*.tre > alltree_merge.tre
PHYLIPNEW-3.69.650/bin/fconsense -intreefile alltree_merge.tre -outfile out -treeprint Y
perl ./bin/percentageboostrapTree.pl alltree_merge.treefile $NN Final_boostrap.tre # NN is the input number
How to Install PHYLIPNEW please Click on here or Click on here(Chinese)
The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:
D_ij=(1/L) * [(sum(d(l)_ij))]
Where L is the length of regions where SNPs can be identified, and given the alleles at position l
are A/C between sample i
and sample j
:
d(l)_ij=0.0 if the genotypes of the two individuals were AA and AA;
d(l)_ij=0.5 if the genotypes of the two individuals were AA and AC;
d(l)_ij=0.0 if the genotypes of the two individuals were AC and AC;
d(l)_ij=1.0 if the genotypes of the two individuals were AA and CC;
d(l)_ij=0.0 if the genotypes of the two individuals were CC and CC;
To further know about the p_distance matrix based the VCF file, please refer to this website.
VCF2Dis have been cited in more than 150 times by searching against google scholar.
Below were some NJ-tree images that I draw in the paper before.
- 50 Rices NBT
- 31 soybeans NG
Display tree by MAGA after test Data VCF2Dis -i ALL.chr*.genotypes.vcf.gz -SubPop subsample203.list -InSampleGroup pop.info
- 📧 [email protected] / [email protected]
- join the QQ Group : 125293663
######################swimming in the sky and flying in the sea ########################### ##