Supporting repository for "Protein language models trained on multiple sequence alignments learn phylogenetic relationships" (Lupo, Sgarbossa, and Bitbol, 2022). The MSA Transformer model used here was introduced in (Rao el al, 2021).
Clone this repository on your local machine by running
git clone [email protected]:Bitbol-Lab/Phylogeny-MSA-Transformer.git
and move inside the root folder.
We recommend creating and activating a dedicated conda
or virtualenv
Python virtual environment.
Then, install the required libraries:
python -m pip install -U -r requirements.txt
In order to run the notebooks, the following python packages are required:
- tqdm
- jupyter
- matplotlib
- statsmodels
- biopython
- swalign
- esm==0.4.0
prody
and HMMER
are required to run the Python script data/Pfam_Seed/fetch_seed_MSAs.py
, if you wish to create new
Pfam full MSAs instead of using the ones provided.
Our work can be cited using the following BibTeX entry:
@article{lupo2022protein,
title={Protein language models trained on multiple sequence alignments learn phylogenetic relationships},
author={Lupo, Umberto and Sgarbossa, Damiano and Bitbol, Anne-Florence},
year={2022},
volume={13},
number={6298},
journal={Nat. Commun.},
doi={10.1038/s41467-022-34032-y}
}