Skip to content

LPDI-EPFL/DrugFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DrugFlow & FlexFlow

Code repository for "Multi-domain Distribution Learning for De Novo Drug Design" by Arne Schneuing*, Ilia Igashov*, Adrian W. Dobbelstein, Thomas Castiglione, Michael M. Bronstein, and Bruno Correia

Abstract

We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.

Setup

Conda Environment

Create a conda/mamba environment

conda env create -f environment.yaml -n drugflow
conda activate drugflow

and add the Gnina executable for docking score computation

wget https://github.com/gnina/gnina/releases/download/v1.1/gnina -O $CONDA_PREFIX/bin/gnina
chmod +x $CONDA_PREFIX/bin/gnina

Docker Container

A pre-built Docker container is available on DockerHub:

docker pull igashov/drugflow:0.0.3

Basic Usage

To sample molecules for a protein target:

# Download a model
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow.ckpt

# Generate molecules
python src/generate.py \
  --protein examples/kras.pdb \
  --ref_ligand examples/kras_ref_ligand.sdf \
  --checkpoint checkpoints/drugflow.ckpt \
  --output examples/samples.sdf

For more options, see

python src/generate.py --help

Models

Please find model checkpoints here or use the download links below:

Dataset preparation

Pre-processed dataset

The preprocessed dataset is available on Zenodo

wget https://zenodo.org/records/14919171/files/processed_crossdocked.zip
unzip processed_crossdocked.zip

(Optional) running pre-processing locally

To process the raw dataset locally, first download and extract the CrossDocked dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data.

Specify input and output directories

CROSSDOCKED_DATA=...  # location at which the dataset was extracted
PROCESSED_DATA=...  # location at which the processed dataset will be stored

Then, preprocess the data for DrugFlow

python src/data/process_crossdocked.py $CROSSDOCKED_DATA \
       --outdir $PROCESSED_DATA \
       --flex

(Optional) Pre-processing of a custom preference alignment dataset

To create a dataset for preference alignment (PA), first, download the preprocessed dataset.

Then, sample a synthetic dataset using a pretrained reference model and evaluate the samples by first specifying input and output directories and evaluate the samples.

PREPROCESSED_DATA=...  # Location of the preprocessed data directory
SAMPLES_DIR=...  # Location where the sampled dataset is stored
EVALUATED_DATA=...  # Directory for evaluation output

Specify input and output directories for the PA dataset:

PROCESSED_DATA=...  # Location at which the processed dataset will be stored
METRICS_PATH=$EVALUATED_DATA/metrics_detailed.csv
CRITERION=...  # Preference alignment criterion ('reos.all', 'medchem.sa', 'medchem.qed', 'gnina.vina_efficiency', or 'combined')

Finally, preprocess the data for DrugFlow-PA:

python src/data/process_dpo_dataset.py \
       --smplsdir $SAMPLES_DIR \
       --basedir $PROCESSED_DATA \
       --datadir $PREPROCESSED_DATA \
       --dpo-criterion $CRITERION \
       --metrics-detailed $METRICS_PATH \
       --ignore-missing-scores

Training

Example config files are provided for:

  • DrugFlow: CONFIG=configs/training/drugflow.yml
  • FlexFlow: CONFIG=configs/training/flexflow.yml
  • Preference alignment: CONFIG=configs/training/preference_alignment.yml

Create a symlink to the processed dataset and for the output directory

LOGDIR=...  # where checkpoints, and validation outputs will be saved
ln -s $PROCESSED_DATA processed_crossdocked
ln -s $LOGDIR runs

Alternatively, you can change the corresponding paths in the config files.

To launch the training job for the DrugFlow base model, for example, run

python src/train.py --config $CONFIG

Inference

Checkpoints

Pretrained checkpoints can be downloaded from Zenodo with

# Base DrugFlow model
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow.ckpt

# DrugFlow + confidence head
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow_ood.ckpt

# FlexFlow
wget -P checkpoints/ https://zenodo.org/records/14919171/files/flexflow.ckpt

# DrugFlow after preference alignment
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow_pa_comb.ckpt

Sampling for all proteins in the test set

The selected checkpoint, e.g. checkpoints/drugflow.ckpt, must be specified in configs/sampling/sample_and_maybe_evaluate.yml. To sample with your own trained model, simply provide a custom checkpoint path instead.

Furthermore, you need to update the sample_outdir parameter in the sampling config file or link the desired output location

SAMPLE_OUTDIR=...  # where samples will be saved
ln -s $SAMPLE_OUTDIR samples

For sampling, run

python src/sample_and_evaluate.py --config configs/sampling/sample_and_maybe_evaluate.yml

which supports parallelization across target pockets by specifying --job_id and --n_jobs. To also evaluate the results, set evaluate: True in the sampling config file.

Evaluating samples

We provide evaluators for metrics used in our paper. To evaluate samples, specify:

SAMPLES_DIR=...  # Location where the sampled dataset is stored
EVALUATED_DATA_ALL=...  # Temporary directory for evaluation output
EVALUATED_DATA=...  # Evaluation output

Run the evaluation:

python scripts/python/evaluate_baselines.py \
       --in_dir $SAMPLES_DIR \
       --out_dir $EVALUATED_DATA_ALL

python scripts/python/postprocess_metrics.py \
       --in_dir $EVALUATED_DATA_ALL \
       --out_dir $EVALUATED_DATA

Per-sample evaluation results will be stored in EVALUATED_DATA/metrics_detailed.csv and aggregated metrics will be stored in EVALUATED_DATA/metrics_aggregated.csv.

Samples

DrugFlow and baseline samples are available on Zenodo:

Reference

@inproceedings{
  schneuing2025multidomain,
  title={Multi-domain Distribution Learning for De Novo Drug Design},
  author={Arne Schneuing and Ilia Igashov and Adrian W. Dobbelstein and Thomas Castiglione and Michael M. Bronstein and Bruno Correia},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=g3VCIM94ke}
}