Skip to content

duartegroup/Het-retro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

850a845 · May 9, 2024
Nov 22, 2018
May 9, 2024
Nov 22, 2019
Oct 13, 2023
Dec 4, 2019
Aug 13, 2020
Oct 1, 2019
Dec 13, 2019
Jun 18, 2019
May 25, 2018
Aug 13, 2020
May 7, 2024
Jul 25, 2018
Jul 25, 2018
Dec 22, 2017
Oct 1, 2019
Oct 1, 2019
Oct 1, 2019
Dec 13, 2019
Oct 1, 2019
Oct 1, 2019

Repository files navigation

Heterocycle Retrosynthesis

This repository complements our publication "Transfer learning for Heterocycle Synthesis Prediction": https://chemrxiv.org/engage/chemrxiv/article-details/6617d56321291e5d1d9ef449

Requirements

The specific version used in this project were: Python: 3.6.9 Torch Version: 1.2.0 TorchText Version: 0.4.0 ONMT Version: 1.0.0 RDKit: 2019.03.2

Conda Environemt Setup

conda create -n het-retro python=3.6
conda activate het-retro
conda install -c rdkit rdkit=2019.03.2 -y
conda install -c pytorch pytorch=1.2.0 -y
git clone https://github.com/ewawieczorek/Het-retro.git
cd Het-retro
pip install -e .

Quickstart

The training and evaluation was performed using OpenNMT-py. The full documentation of the OpenNMT library can be found here.

Step 1: Preprocess the data

Start by preparing the Ring and USPTO datasets as described in their respective directories.

Single data sets

This preprocessing approach is suitable for pre-training and fine-tuning:

DATADIR=data/uspto_dataset
onmt_preprocess -train_src $DATADIR/product-train.txt -train_tgt $DATADIR/reactant-train.txt -valid_src $DATADIR/product-valid.txt -valid_tgt $DATADIR/reactant-valid.txt -save_data $DATADIR/uspto -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab
DATADIR=data/ring_dataset
onmt_preprocess -train_src $DATADIR/product-train.txt -train_tgt $DATADIR/reactant-train.txt -valid_src $DATADIR/product-valid.txt -valid_tgt $DATADIR/reactant-valid.txt -save_data $DATADIR/sequential -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab

Multi-task data sets

This preprocessing approach is suitable for multi-task learning and mixed fine-tuning:

DATASET=data/uspto_dataset
DATASET_TRANSFER=data/ring_dataset

onmt_preprocess -train_src ${DATASET}/product-train.txt ${DATASET_TRANSFER}/product-train.txt -train_tgt ${DATASET}/reactant-train.txt ${DATASET_TRANSFER}/reactant-train.txt -train_ids uspto ring  -valid_src ${DATASET_TRANSFER}/product-valid.txt -valid_tgt ${DATASET_TRANSFER}/reactant-valid.txt -save_data ${DATASET_TRANSFER}/multi_task -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab

The files have been previously tokenized using the tokenization function for the reaction SMILES adapted from https://github.com/pschwllr/MolecularTransformer.

The data consists of parallel precursors (reactant) and products (product) data containing one reaction per line with tokens separated by a space:

  • reactant-train.txt
  • product-train.txt
  • reactant-val.txt
  • product-val.txt

After running the preprocessing, the following files are generated:

  • uspto.train.pt: serialized PyTorch file containing training data
  • uspto.valid.pt: serialized PyTorch file containing validation data
  • uspto.vocab.pt: serialized PyTorch file containing vocabulary data

Internally the system never touches the words themselves, but uses these indices.

Step 2: Train the model

The transformer models were trained using the following hyperparameters:

Pretraining

DATADIR=data/uspto_dataset
onmt_train -data $DATADIR/uspto  \
        -save_model  baseline_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000 -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Multi-task transfer learning

DATADIR=data/ring_dataset
WEIGHT1=9
WEIGHT2=1

onmt_train -data $DATADIR/multi_task  \
        -save_model  multi_task_model \
        -data_ids uspto ring --data_weights $WEIGHT1 $WEIGHT2 \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000 -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Fine-tuning

DATADIR=data/ring_dataset
TRAIN_STEPS=6000

onmt_train -data $DATADIR/sequential  \
        -train_from models/baseline_model.pt \
        -save_model  fine_tuned_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000+$TRAIN_STEPS -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Mixed fine-tuning

DATADIR=data/ring_dataset
TRAIN_STEPS=6000

onmt_train -data $DATADIR/multi-task  \
        -train_from models/baseline_model.pt \
        -save_model  mixed_fine_tuned_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000+$TRAIN_STEPS -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Step 3: Chemical reaction prediction

To test the model on new reactions run:

DATADIR=data/ring_dataset
onmt_translate -model models/mixed_fine_tuned_model.pt -src $DATADIR/product-test.txt -output predictions.txt  -n_best 1 -beam_size 5 -max_length 300 -batch_size 64 

To perfrom ensemble decoding:

DATADIR=data/ring_dataset
onmt_translate -model models/baseline_model.pt models/fine_tuned_model.pt -src $DATADIR/product-test.txt -output ensemble_predictions.txt  -n_best 1 -beam_size 5 -max_length 300 -batch_size 64
 

Models

The models need to be downloaded from https://doi.org/10.6084/m9.figshare.25723818 and placed into a models folder. The models provided are:

  • pretrained (baseline) retrosynthesis prediction model
  • forward reaction prediction multi-task model (used for round-trip accuracy calculation)
  • retrosynthesis prediction multi-task, fine-tuned and mixed fine-tuned models

Citation

@misc{wieczorek_transfer_2024,
	title = {Transfer learning for {Heterocycle} {Synthesis} {Prediction}},
	url = {https://chemrxiv.org/engage/chemrxiv/article-details/6617d56321291e5d1d9ef449},
	doi = {10.26434/chemrxiv-2024-ngqqg},
	publisher = {ChemRxiv},
	author = {Wieczorek, Ewa and Sin, Joshua W. and Holland, Matthew T. O. and Wilbraham, Liam and Perez, Victor S. and Bradley, Anthony and Miketa, Dominik and Brennan, Paul E. and Duarte, Fernanda},
	month = may,
	year = {2024}
}


This work is based on OpentNMT-py, if you reuse this code please also cite the underlying code framework.

OpenNMT: Neural Machine Translation Toolkit

OpenNMT technical report

@inproceedings{opennmt,
  author    = {Guillaume Klein and
               Yoon Kim and
               Yuntian Deng and
               Jean Senellart and
               Alexander M. Rush},
  title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
  booktitle = {Proc. ACL},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/P17-4012},
  doi       = {10.18653/v1/P17-4012}
}