This is the official implementation of the paper "Latent-based Directed Evolution for Protein Sequence Design".
Our repository is structured as follows:
.
├── active_optimize.sh # inference + active learning
├── environment.yml
├── exps # experiments results
├── optimize.sh # inference
├── preprocessed_data
├── README.md
├── scripts # main executable scripts
├── src
│ ├── common # common utilities
│ ├── dataio # dataloader
│ └── models
├── train.sh # training script
└── visualize_latent.sh # visualize trained latent
You should have Python 3.10 or higher. I highly recommend creating a virtual environment like conda. If so, run the below commands to install:
conda env create -f environment.yml
Download the oracle landscape models by the following commands (using script provided here):
cd scripts
bash download_landscape.sh
To train VAE model for each benchmark dataset, go to the root directory and execute the train.sh
file. Take avGFP
as the example, run the following command:
bash train.sh ./scripts/configs/rnn_template.py 0 template avGFP 20 256
Checkpoints will be saved in exps/ckpts/
folder. Details of passed arguments can be found here
To perform optimization, go to the root directory and execute the optimize.sh
file. Take avGFP
as the example, run the following command:
bash optimize.sh avGFP 0 template <model_ckpt_path> <oracle_ckpt_path> 1 rnn
Similar to perform active learning alongside with optimization, you can see details of passed argumetns in active_optmize.sh
file.
Results will be saved in exps/results_no_active
and exps/results
folders.
To average results of 5 seeds, check calculate.py
.
If you find our work useful for your research, please cite:
@article{Tran_2025,
doi = {10.1088/2632-2153/adc2e2},
url = {https://dx.doi.org/10.1088/2632-2153/adc2e2},
year = {2025},
month = {mar},
publisher = {IOP Publishing},
volume = {6},
number = {1},
pages = {015070},
author = {Tran, Thanh V T and Khang Ngo, Nhat and Thanh Duy Nguyen, Viet and Hy, Truong-Son},
title = {LatentDE: latent-based directed evolution for protein sequence design},
journal = {Machine Learning: Science and Technology},
abstract = {Directed evolution (DE) has been the most effective method for protein engineering that optimizes biological functionalities through a resource-intensive process of screening or selecting among a vast range of mutations. To mitigate this extensive procedure, recent advancements in machine learning-guided methodologies center around the establishment of a surrogate sequence-function model. In this paper, we propose latent-based DE (LDE), an evolutionary algorithm designed to prioritize the exploration of high-fitness mutants in the latent space. At its core, LDE is a regularized variational autoencoder (VAE), harnessing the capabilities of the state-of-the-art protein language model, ESM-2, to construct a meaningful latent space of sequences. From this encoded representation, we present a novel approach for efficient traversal on the fitness landscape, employing a combination of gradient-based methods and DE. Experimental evaluations conducted on eight protein sequence design tasks demonstrate the superior performance of our proposed LDE over previous baseline algorithms.}
}