Heterocycle Retrosynthesis

This repository complements our publication "Transfer learning for Heterocycle Synthesis Prediction": https://chemrxiv.org/engage/chemrxiv/article-details/6617d56321291e5d1d9ef449

Requirements

The specific version used in this project were: Python: 3.6.9 Torch Version: 1.2.0 TorchText Version: 0.4.0 ONMT Version: 1.0.0 RDKit: 2019.03.2

Conda Environemt Setup

conda create -n het-retro python=3.6
conda activate het-retro
conda install -c rdkit rdkit=2019.03.2 -y
conda install -c pytorch pytorch=1.2.0 -y
git clone https://github.com/ewawieczorek/Het-retro.git
cd Het-retro
pip install -e .

Quickstart

The training and evaluation was performed using OpenNMT-py. The full documentation of the OpenNMT library can be found here.

Step 1: Preprocess the data

Start by preparing the Ring and USPTO datasets as described in their respective directories.

Single data sets

This preprocessing approach is suitable for pre-training and fine-tuning:

DATADIR=data/uspto_dataset
onmt_preprocess -train_src $DATADIR/product-train.txt -train_tgt $DATADIR/reactant-train.txt -valid_src $DATADIR/product-valid.txt -valid_tgt $DATADIR/reactant-valid.txt -save_data $DATADIR/uspto -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab

DATADIR=data/ring_dataset
onmt_preprocess -train_src $DATADIR/product-train.txt -train_tgt $DATADIR/reactant-train.txt -valid_src $DATADIR/product-valid.txt -valid_tgt $DATADIR/reactant-valid.txt -save_data $DATADIR/sequential -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab

Multi-task data sets

This preprocessing approach is suitable for multi-task learning and mixed fine-tuning:

DATASET=data/uspto_dataset
DATASET_TRANSFER=data/ring_dataset

onmt_preprocess -train_src ${DATASET}/product-train.txt ${DATASET_TRANSFER}/product-train.txt -train_tgt ${DATASET}/reactant-train.txt ${DATASET_TRANSFER}/reactant-train.txt -train_ids uspto ring  -valid_src ${DATASET_TRANSFER}/product-valid.txt -valid_tgt ${DATASET_TRANSFER}/reactant-valid.txt -save_data ${DATASET_TRANSFER}/multi_task -src_seq_length 3000 -tgt_seq_length 3000 -src_vocab_size 3000 -tgt_vocab_size 3000 -share_vocab

The files have been previously tokenized using the tokenization function for the reaction SMILES adapted from https://github.com/pschwllr/MolecularTransformer.

The data consists of parallel precursors (reactant) and products (product) data containing one reaction per line with tokens separated by a space:

reactant-train.txt
product-train.txt
reactant-val.txt
product-val.txt

After running the preprocessing, the following files are generated:

uspto.train.pt: serialized PyTorch file containing training data
uspto.valid.pt: serialized PyTorch file containing validation data
uspto.vocab.pt: serialized PyTorch file containing vocabulary data

Internally the system never touches the words themselves, but uses these indices.

Step 2: Train the model

The transformer models were trained using the following hyperparameters:

Pretraining

DATADIR=data/uspto_dataset
onmt_train -data $DATADIR/uspto  \
        -save_model  baseline_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000 -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Multi-task transfer learning

DATADIR=data/ring_dataset
WEIGHT1=9
WEIGHT2=1

onmt_train -data $DATADIR/multi_task  \
        -save_model  multi_task_model \
        -data_ids uspto ring --data_weights $WEIGHT1 $WEIGHT2 \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000 -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Fine-tuning

DATADIR=data/ring_dataset
TRAIN_STEPS=6000

onmt_train -data $DATADIR/sequential  \
        -train_from models/baseline_model.pt \
        -save_model  fine_tuned_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000+$TRAIN_STEPS -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Mixed fine-tuning

DATADIR=data/ring_dataset
TRAIN_STEPS=6000

onmt_train -data $DATADIR/multi-task  \
        -train_from models/baseline_model.pt \
        -save_model  mixed_fine_tuned_model \
        -seed $SEED -gpu_ranks 0  \
        -train_steps 250000+$TRAIN_STEPS -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_size 6144 -batch_type tokens \
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048

Step 3: Chemical reaction prediction

To test the model on new reactions run:

DATADIR=data/ring_dataset
onmt_translate -model models/mixed_fine_tuned_model.pt -src $DATADIR/product-test.txt -output predictions.txt  -n_best 1 -beam_size 5 -max_length 300 -batch_size 64

To perfrom ensemble decoding:

DATADIR=data/ring_dataset
onmt_translate -model models/baseline_model.pt models/fine_tuned_model.pt -src $DATADIR/product-test.txt -output ensemble_predictions.txt  -n_best 1 -beam_size 5 -max_length 300 -batch_size 64

Models

The models need to be downloaded from https://doi.org/10.6084/m9.figshare.25723818 and placed into a models folder. The models provided are:

pretrained (baseline) retrosynthesis prediction model
forward reaction prediction multi-task model (used for round-trip accuracy calculation)
retrosynthesis prediction multi-task, fine-tuned and mixed fine-tuned models

Citation

@misc{wieczorek_transfer_2024,
	title = {Transfer learning for {Heterocycle} {Synthesis} {Prediction}},
	url = {https://chemrxiv.org/engage/chemrxiv/article-details/6617d56321291e5d1d9ef449},
	doi = {10.26434/chemrxiv-2024-ngqqg},
	publisher = {ChemRxiv},
	author = {Wieczorek, Ewa and Sin, Joshua W. and Holland, Matthew T. O. and Wilbraham, Liam and Perez, Victor S. and Bradley, Anthony and Miketa, Dominik and Brennan, Paul E. and Duarte, Fernanda},
	month = may,
	year = {2024}
}

This work is based on OpentNMT-py, if you reuse this code please also cite the underlying code framework.

OpenNMT: Neural Machine Translation Toolkit

OpenNMT technical report

@inproceedings{opennmt,
  author    = {Guillaume Klein and
               Yoon Kim and
               Yuntian Deng and
               Jean Senellart and
               Alexander M. Rush},
  title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
  booktitle = {Proc. ACL},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/P17-4012},
  doi       = {10.18653/v1/P17-4012}
}

Name	Name	Last commit message	Last commit date
Latest commit ewawieczorek Update README.md May 9, 2024 850a845 · May 9, 2024 History 2,440 Commits
config	config	Config files samples	Nov 22, 2018
data	data	Update README.md	May 9, 2024
docs	docs	Implementation of the paper "Jointly Learning to Align & Translate wi…	Nov 22, 2019
onmt	onmt	Fix probability logging	Oct 13, 2023
tools	tools	Move average_models to onmt.bin, create entry point (#1661)	Dec 4, 2019
.gitignore	.gitignore	carbohydrate transformer	Aug 13, 2020
.travis.yml	.travis.yml	Preparing for pip (#1581)	Oct 1, 2019
CHANGELOG.md	CHANGELOG.md	changelog for 1.0.0 (#1680)	Dec 13, 2019
CONTRIBUTING.md	CONTRIBUTING.md	fix incorrect script path in CONTRIBUTING.md (#1470) (#1472)	Jun 18, 2019
LICENSE.md	LICENSE.md	cosmetics and README	May 25, 2018
ONMT_README.md	ONMT_README.md	carbohydrate transformer	Aug 13, 2020
README.md	README.md	Update README.md	May 7, 2024
floyd.yml	floyd.yml	Run on FloydHub integration	Jul 25, 2018
floyd_requirements.txt	floyd_requirements.txt	Run on FloydHub integration	Jul 25, 2018
github_deploy_key_opennmt_opennmt_py.enc	github_deploy_key_opennmt_opennmt_py.enc	.	Dec 22, 2017
preprocess.py	preprocess.py	Preparing for pip (#1581)	Oct 1, 2019
requirements.opt.txt	requirements.opt.txt	Preparing for pip (#1581)	Oct 1, 2019
server.py	server.py	Preparing for pip (#1581)	Oct 1, 2019
setup.py	setup.py	changelog for 1.0.0 (#1680)	Dec 13, 2019
train.py	train.py	Preparing for pip (#1581)	Oct 1, 2019
translate.py	translate.py	Preparing for pip (#1581)	Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heterocycle Retrosynthesis

Requirements

Conda Environemt Setup

Quickstart

Step 1: Preprocess the data

Single data sets

Multi-task data sets

Step 2: Train the model

Pretraining

Multi-task transfer learning

Fine-tuning

Mixed fine-tuning

Step 3: Chemical reaction prediction

Models

Citation

About

Releases

Packages

Languages

License

duartegroup/Het-retro

Folders and files

Latest commit

History

Repository files navigation

Heterocycle Retrosynthesis

Requirements

Conda Environemt Setup

Quickstart

Step 1: Preprocess the data

Single data sets

Multi-task data sets

Step 2: Train the model

Pretraining

Multi-task transfer learning

Fine-tuning

Mixed fine-tuning

Step 3: Chemical reaction prediction

Models

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages