Monte-Carlo generation of synthetic multiple sequence alignments along phylogenetic trees using a protein language model
Lab Immersion at EPFL
Lab: Bitbol Lab – Laboratory of Computational Biology and Theoretical Biophysics
Professor: Anne-Florence Bitbol
Supervisors: Umberto Lupo, Damiano Sgarbossa, Cyril Antoine Malbranke
Table of Contents
This project generates a phylogenetic tree from a natural multiple sequence alignment (MSA) using either FastTree or IQTree. From this tree, it produces a synthetic MSA through a Metropolis–Hastings algorithm for Markov Chain Monte Carlo (MCMC) employing the probabilities given by the ESM2 model. The aim is to acquire synthetic data to fine-tune the MSA transformer.
- Python 3.11
- FastTree
- IQTree
- MAFFT
- HMMER
To install MAFFT:
conda install -c bioconda mafft
- Clone the repo
git clone https://github.com/Bitbol-Lab/Phylogeny-ESM2.git
- Install the requirements
pip install -r requirements.txt
-
You can create a synthetic MSA for a single alignment using:
python main.py -f <natural_msa_path>
You can see all available command line arguments with:
python main.py -h
-
For creating a synthetic MSA for multiple alignments:
python run.py
There might be the need to change the extension of the MSA input files used by this script. To do so change
run.py:51
:if f.endswith('.fasta'):
You can see all available command line arguments with:
python run.py -h
-
For obtaining hamming distances correlation results:
python results.py -f <msa_natural_dir> -m <method> -o <msa_synthetic_dir>
You can see all available command line arguments with:
python results.py -h
-
For obtaining HMMER scores and their violin plots:
python hmmer_scores.py -m <method> -s <hmm_profile_dir> -o <msa_dir> -r <output_dir> python violin_plot.py <synthetic_scores_dir> <natural_scores_dir>
Apache License 2.0