Code repository for "Multi-domain Distribution Learning for De Novo Drug Design" by Arne Schneuing*, Ilia Igashov*, Adrian W. Dobbelstein, Thomas Castiglione, Michael M. Bronstein, and Bruno Correia
We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.
Create a conda/mamba environment
conda env create -f environment.yaml -n drugflow
conda activate drugflow
and add the Gnina executable for docking score computation
wget https://github.com/gnina/gnina/releases/download/v1.1/gnina -O $CONDA_PREFIX/bin/gnina
chmod +x $CONDA_PREFIX/bin/gnina
A pre-built Docker container is available on DockerHub:
docker pull igashov/drugflow:0.0.3
To sample molecules for a protein target:
# Download a model
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow.ckpt
# Generate molecules
python src/generate.py \
--protein examples/kras.pdb \
--ref_ligand examples/kras_ref_ligand.sdf \
--checkpoint checkpoints/drugflow.ckpt \
--output examples/samples.sdf
For more options, see
python src/generate.py --help
Please find model checkpoints here or use the download links below:
The preprocessed dataset is available on Zenodo
wget https://zenodo.org/records/14919171/files/processed_crossdocked.zip
unzip processed_crossdocked.zip
To process the raw dataset locally, first download and extract the CrossDocked dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data.
Specify input and output directories
CROSSDOCKED_DATA=... # location at which the dataset was extracted
PROCESSED_DATA=... # location at which the processed dataset will be stored
Then, preprocess the data for DrugFlow
python src/data/process_crossdocked.py $CROSSDOCKED_DATA \
--outdir $PROCESSED_DATA \
--flex
To create a dataset for preference alignment (PA), first, download the preprocessed dataset.
Then, sample a synthetic dataset using a pretrained reference model and evaluate the samples by first specifying input and output directories and evaluate the samples.
PREPROCESSED_DATA=... # Location of the preprocessed data directory
SAMPLES_DIR=... # Location where the sampled dataset is stored
EVALUATED_DATA=... # Directory for evaluation output
Specify input and output directories for the PA dataset:
PROCESSED_DATA=... # Location at which the processed dataset will be stored
METRICS_PATH=$EVALUATED_DATA/metrics_detailed.csv
CRITERION=... # Preference alignment criterion ('reos.all', 'medchem.sa', 'medchem.qed', 'gnina.vina_efficiency', or 'combined')
Finally, preprocess the data for DrugFlow-PA:
python src/data/process_dpo_dataset.py \
--smplsdir $SAMPLES_DIR \
--basedir $PROCESSED_DATA \
--datadir $PREPROCESSED_DATA \
--dpo-criterion $CRITERION \
--metrics-detailed $METRICS_PATH \
--ignore-missing-scores
Example config files are provided for:
- DrugFlow:
CONFIG=configs/training/drugflow.yml
- FlexFlow:
CONFIG=configs/training/flexflow.yml
- Preference alignment:
CONFIG=configs/training/preference_alignment.yml
Create a symlink to the processed dataset and for the output directory
LOGDIR=... # where checkpoints, and validation outputs will be saved
ln -s $PROCESSED_DATA processed_crossdocked
ln -s $LOGDIR runs
Alternatively, you can change the corresponding paths in the config files.
To launch the training job for the DrugFlow base model, for example, run
python src/train.py --config $CONFIG
Pretrained checkpoints can be downloaded from Zenodo with
# Base DrugFlow model
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow.ckpt
# DrugFlow + confidence head
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow_ood.ckpt
# FlexFlow
wget -P checkpoints/ https://zenodo.org/records/14919171/files/flexflow.ckpt
# DrugFlow after preference alignment
wget -P checkpoints/ https://zenodo.org/records/14919171/files/drugflow_pa_comb.ckpt
The selected checkpoint, e.g. checkpoints/drugflow.ckpt
, must be specified in configs/sampling/sample_and_maybe_evaluate.yml
.
To sample with your own trained model, simply provide a custom checkpoint path instead.
Furthermore, you need to update the sample_outdir
parameter in the sampling config file or link the desired output location
SAMPLE_OUTDIR=... # where samples will be saved
ln -s $SAMPLE_OUTDIR samples
For sampling, run
python src/sample_and_evaluate.py --config configs/sampling/sample_and_maybe_evaluate.yml
which supports parallelization across target pockets by specifying --job_id
and --n_jobs
.
To also evaluate the results, set evaluate: True
in the sampling config file.
We provide evaluators for metrics used in our paper. To evaluate samples, specify:
SAMPLES_DIR=... # Location where the sampled dataset is stored
EVALUATED_DATA_ALL=... # Temporary directory for evaluation output
EVALUATED_DATA=... # Evaluation output
Run the evaluation:
python scripts/python/evaluate_baselines.py \
--in_dir $SAMPLES_DIR \
--out_dir $EVALUATED_DATA_ALL
python scripts/python/postprocess_metrics.py \
--in_dir $EVALUATED_DATA_ALL \
--out_dir $EVALUATED_DATA
Per-sample evaluation results will be stored in EVALUATED_DATA/metrics_detailed.csv
and aggregated metrics will be stored in EVALUATED_DATA/metrics_aggregated.csv
.
DrugFlow and baseline samples are available on Zenodo:
@inproceedings{
schneuing2025multidomain,
title={Multi-domain Distribution Learning for De Novo Drug Design},
author={Arne Schneuing and Ilia Igashov and Adrian W. Dobbelstein and Thomas Castiglione and Michael M. Bronstein and Bruno Correia},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=g3VCIM94ke}
}