Skip to content

Latest commit

 

History

History
115 lines (98 loc) · 4.65 KB

step-tutorials.md

File metadata and controls

115 lines (98 loc) · 4.65 KB

Step-by-Step tutorial:

Step01: specify the set of strains
Load the strain file within the run directory which contains a list of NCBI RefSeq accession numbers or names of own GenBank files (without file ending).

Step03: extract gene sequences from GenBank (*.gbk) file
Extract genes from GenBank (*.gbk) file as nucleotide and amino acid sequences

  • Input:
    In folder ./data/TestSet/:
    *.gbk file
  • Output:
    In folder ./data/TestSet/nucleotide_fna:
    *.fna file (nucleotide sequences)
    In folder ./data/TestSet/protein_faa:
    *.faa file (amino acid sequences)

Step04: extract metadata from GenBank (*.gbk) file (Alternative: provide manually curated metadata table)
Extracting metadata ( E.g.: country, collection_date, host, strain) or provide a tab-separated values (TSV) file.

strain location host age serotype benzylpenicillin MIC (ug/mL) ...
NC_01 Germany 35 23A 0.016 ...
NC_02 Switzerland 66 23B 4 ...
  • Input:
    In folder ./data/TestSet/:
    *.gbk file
  • Output:
    In folder ./data/TestSet/:
    metainfo.tsv (metadata for visualization)

User-provided metadata:

  • -mi --metainfo_fpath

    the absolute path for meta_information file (e.g.: /path/meta.out)

Step05: compute gene clusters
all-against-all protein sequences comparison by Diamond and clustering of genes using MCL

  • Input:
    In folder ./data/TestSet/protein_faa/:
    *.faa file
  • Output:
    In folder ./data/TestSet/protein_faa/diamond_matches/:
    allclusters.cpk (dictionary for gene clusters)
    diamond_geneCluster_dt: {clusterID:[ count_strains,[memb1,...],count_genes }

Step06: build alignments, gene trees from gene clusters and run phylogeny-based post-processing
Load nucleotide sequences in gene clusters, construct nucleotide and amino acid alignment, build a gene tree based on nucleotide alignment, split paralogs and export the gene tree in json file for visualization

  • Input:
    In folder ./data/TestSet/protein_faa/diamond_matches/:
    allclusters.cpk file
  • Output:
    In folder ./data/TestSet/protein_faa/diamond_matches/:
    allclusters_final.tsv ( final gene clusters)
    In folder ./data/TestSet/geneCluster/:
    GC*.fna (nucleotide fasta)
    GC*_na_aln.fa (nucleotide alignment)
    GC*.faa (amino acid fasta)
    GC*_aa_aln.fa (amino acid alignment)
    GC*_tree.json (gene tree in json file)

Step07: construct core gene SNP matrix
Call SNPs in strictly core genes (without gene duplication) and build SNP matrix for strain tree

  • Output:
    In folder ./data/TestSet/geneCluster/:
    SNP_whole_matrix.aln (SNP matrix as pseudo alignment)
    snp_pos.cpk (snp positions)

Step08: build the strain tree using core gene SNPs
Use fasttree to build core genome phylogeny and further refine it by RAxML

  • Input:
    In folder ./data/TestSet/geneCluster/:
    SNP_whole_matrix.aln
  • Output:
    In folder ./data/TestSet/geneCluster/:
    strain_tree.nwk

Step09: infer gene gain and loss event
Use ancestral reconstruction algorithm (treetime) to infer gain and loss events

  • Output:
    In folder ./data/TestSet/geneCluster/:
    genePresence.aln (gene presence and absence pattern)
    GC000*_patterns.json (gene gain/loss pattern for each gene cluster)

Step10: export gene cluster json file
Export json file for gene cluster datatable visualization
In folder ./data/TestSet/geneCluster/:

  • Output:
    In folder ./data/TestSet/geneCluster/
    geneCluster.json (gene cluster json for datatable visualization)

Step11: export tree and metadata json file
Export json files for strain tree and metadata visualization

  • Input:
    In folder ./data/TestSet/:
    metainfo.tsv (metadata table)
    In folder ./data/TestSet/geneCluster/:
    strain_tree.nwk (strain tree)
  • Output:
    In folder ./data/TestSet/geneCluster/
    coreGenomeTree.json (strain tree visualization)
    strainMetainfo.json (strain metadata table visualization)
  • Data collection for visualization (sending data to server) In folder ./data/TestSet/vis/
    geneCluster.json coreGenomeTree.json strainMetainfo.json In folder ./data/TestSet/vis/geneCluster/
    GC000*_na_aln.fa GC000*_aa_aln.fa GC000*_tree.json GC000*_patterns.json