LatentDE: Latent-based Directed Evolution for Protein Sequence Design

Introduction

This is the official implementation of the paper "Latent-based Directed Evolution for Protein Sequence Design".

Figure: Overview of the LatentDE pipeline.

Structure description

Our repository is structured as follows:

.
├── active_optimize.sh          # inference + active learning
├── environment.yml
├── exps                        # experiments results
├── optimize.sh                 # inference
├── preprocessed_data
├── README.md
├── scripts                     # main executable scripts
├── src
│   ├── common                  # common utilities
│   ├── dataio                  # dataloader
│   └── models
├── train.sh                    # training script
└── visualize_latent.sh         # visualize trained latent

Installation

You should have Python 3.10 or higher. I highly recommend creating a virtual environment like conda. If so, run the below commands to install:

conda env create -f environment.yml

Download the oracle landscape models by the following commands (using script provided here):

cd scripts
bash download_landscape.sh

Usage

Training

To train VAE model for each benchmark dataset, go to the root directory and execute the train.sh file. Take avGFP as the example, run the following command:

bash train.sh ./scripts/configs/rnn_template.py 0 template avGFP 20 256

Checkpoints will be saved in exps/ckpts/ folder. Details of passed arguments can be found here

Inference

To perform optimization, go to the root directory and execute the optimize.sh file. Take avGFP as the example, run the following command:

bash optimize.sh avGFP 0 template <model_ckpt_path> <oracle_ckpt_path> 1 rnn

Similar to perform active learning alongside with optimization, you can see details of passed argumetns in active_optmize.sh file.

Results will be saved in exps/results_no_active and exps/results folders.

To average results of 5 seeds, check calculate.py.

Citation

If you find our work useful for your research, please cite:

@article{Tran_2025,
doi = {10.1088/2632-2153/adc2e2},
url = {https://dx.doi.org/10.1088/2632-2153/adc2e2},
year = {2025},
month = {mar},
publisher = {IOP Publishing},
volume = {6},
number = {1},
pages = {015070},
author = {Tran, Thanh V T and Khang Ngo, Nhat and Thanh Duy Nguyen, Viet and Hy, Truong-Son},
title = {LatentDE: latent-based directed evolution for protein sequence design},
journal = {Machine Learning: Science and Technology},
abstract = {Directed evolution (DE) has been the most effective method for protein engineering that optimizes biological functionalities through a resource-intensive process of screening or selecting among a vast range of mutations. To mitigate this extensive procedure, recent advancements in machine learning-guided methodologies center around the establishment of a surrogate sequence-function model. In this paper, we propose latent-based DE (LDE), an evolutionary algorithm designed to prioritize the exploration of high-fitness mutants in the latent space. At its core, LDE is a regularized variational autoencoder (VAE), harnessing the capabilities of the state-of-the-art protein language model, ESM-2, to construct a meaningful latent space of sequences. From this encoded representation, we present a novel approach for efficient traversal on the fitness landscape, employing a combination of gradient-based methods and DE. Experimental evaluations conducted on eight protein sequence design tasks demonstrate the superior performance of our proposed LDE over previous baseline algorithms.}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LatentDE: Latent-based Directed Evolution for Protein Sequence Design

Table of Contents:

Introduction

Structure description

Installation

Usage

Training

Inference

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
preprocessed_data		preprocessed_data
scripts		scripts
src		src
static		static
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
active_optimize.sh		active_optimize.sh
environment.yml		environment.yml
optimize.sh		optimize.sh
train.sh		train.sh
visualize_latent.sh		visualize_latent.sh

License

Fsoft-AIC/LatentDE

Folders and files

Latest commit

History

Repository files navigation

LatentDE: Latent-based Directed Evolution for Protein Sequence Design

Table of Contents:

Introduction

Structure description

Installation

Usage

Training

Inference

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages