Skip to content

Monte-Carlo generation of synthetic multiple sequence alignments along phylogenetic trees using a protein language model

Notifications You must be signed in to change notification settings

Bitbol-Lab/Phylogeny-ESM2

Repository files navigation

Monte-Carlo generation of synthetic multiple sequence alignments along phylogenetic trees using a protein language model

Lab Immersion at EPFL
Lab: Bitbol Lab – Laboratory of Computational Biology and Theoretical Biophysics
Professor: Anne-Florence Bitbol
Supervisors: Umberto Lupo, Damiano Sgarbossa, Cyril Antoine Malbranke

Table of Contents
  1. Description
  2. Getting Started
  3. Usage
  4. License

Description

This project generates a phylogenetic tree from a natural multiple sequence alignment (MSA) using either FastTree or IQTree. From this tree, it produces a synthetic MSA through a Metropolis–Hastings algorithm for Markov Chain Monte Carlo (MCMC) employing the probabilities given by the ESM2 model. The aim is to acquire synthetic data to fine-tune the MSA transformer.

(back to top)

Built With

  • Python
  • Pandas
  • Numpy
  • Torch
  • Bio

(back to top)

Getting Started

Prerequisites

  • Python 3.11
  • FastTree
  • IQTree
  • MAFFT
  • HMMER

To install MAFFT:

conda install -c bioconda mafft

Installation

  1. Clone the repo
    git clone https://github.com/Bitbol-Lab/Phylogeny-ESM2.git
  2. Install the requirements
    pip install -r requirements.txt

(back to top)

Usage

  1. You can create a synthetic MSA for a single alignment using:

    python main.py -f <natural_msa_path>

    You can see all available command line arguments with:

    python main.py -h
  2. For creating a synthetic MSA for multiple alignments:

    python run.py

    There might be the need to change the extension of the MSA input files used by this script. To do so change run.py:51:

    if f.endswith('.fasta'):

    You can see all available command line arguments with:

    python run.py -h

Results

  1. For obtaining hamming distances correlation results:

    python results.py -f <msa_natural_dir> -m <method> -o <msa_synthetic_dir>

    You can see all available command line arguments with:

    python results.py -h
  2. For obtaining HMMER scores and their violin plots:

    python hmmer_scores.py -m <method> -s <hmm_profile_dir> -o <msa_dir> -r <output_dir>
    python violin_plot.py <synthetic_scores_dir> <natural_scores_dir>

(back to top)

License

Apache License 2.0

(back to top)

About

Monte-Carlo generation of synthetic multiple sequence alignments along phylogenetic trees using a protein language model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages