From Molecules to Mixtures: Learning Representations of Olfactory Mixture Similarity using Inductive Biases
Create a conda environment:
conda env create -f environment.yml
conda activate pommix
The datasets
folder contains the olfactory properties of single molecules, obtained from the OpenPOM version of the GoodScents-Leffingwell (GS-LF) dataset. This folder also contains the olfactory similarity metrics for mixtures, contained in mixtures_combined.csv
obtained from Snitz et al. (2013), Bushdid et al. (2014) and Ravia et al. (2020). The SMILES strings of the components in the multimolecular mixtures can be found in mixture_smi_definitions_clean_found_duplicates.csv
.
The datasets
folder also contains the pre-determined random splits (prefix random
), cross-validation splits (prefix random_cv
), splitting by number of components in a mixture (in Figure 5a, prefix ablate_components
), and the leave-molecules-out splits (LMO, in Figure 5b, prefix lso_molecules
)
The datasets
folder contains the features used for the RDKit baselines mixture_rdkit_definitions_clean_found_duplicates.csv
and the principal odor map (POM) embeddings used for the XGBoost models mixture_pom_embeddings.npz
. To run the baselines, use the scripts and splits contained in the scripts/baseline
folder.
scripts/pom
contains the code to generate olfactory molecular embeddings by predicting the odor descriptors for a molecule based on the GS-LF dataset.
# run pretraining
# parameters determined from hyperparameter optimization
# model saved in `gs-lf_models/pretrained_pom`
python run_pretraining.py --depth=4 --hidden_dim=320 --dropout=0.1 --lr=0.0001 --tag='pretrained_pom'
scripts/chemix
contains the code to take mixture embeddings to predict the olfactory similarity of two mixtures. Pretrained POM is pulled from the previous step.
# run chemix with pom embeddings
# parameters are from default.yaml, which are determined from hyperparameter optimization
# model saved in `results/{split_type}/`
python run_chemix.py --split={split_type}
scripts/chemix/data_utilities
contain scripts used to generate the mixture features and splits.
scripts/pommix
contains the code that trains both POM and CheMix end-to-end to improve the predictive performance of molecular embeddings generated from the POM. The POM model pulled from first step. The CheMix model is pulled from the same splits used as previous step.
# run POMMix
# parameters generated from previous steps
# model saved in `results/{split_type}/`
python run_pommix.py --split={split_type}
scripts/baseline
contains baseline models used for comparison. This includes Snitz baseline, and XGBoost with RDKit features.
To perform Snitz baseline, the three steps as detailed in the appendix are separated into 3 scripts:
python run_snitz_step1.py
python run_snitz_step2.py
python run_snitz_step3.py # final features are saved in `snitz_step3`
python run_snitz.py # results saved in `snitz_similarity`
To perform XGBoost models for either RDKit or POM features:
python run_xgb.py --feat_type={"rdkit2d", "pom_embeddings", "molt5embeddings"} # results saved in `xgb_{feat_type}`. Molt5 will need to be generated
All figures generated in the paper can be found using scripts found in the figures
folder -- most of them are in figures.ipynb
.
To further study the performance of the POM embeddings, we also test the use of MolT5 (Edwards et al. 2022) embeddings. The embeddings can be generated using the molfeat package. The script to generate the embeddings are created in the scripts/chemix/data_utilities/make_molt5.py
.
We use the embeddings with XGB in scripts/baseline/
and also CheMix scripts/chemix/
.
To test the GNN archiecture, we also perform experiments with the Graphormer and the GPS models, which have performed SOTA on larger datasets.
We direct the user to use the repository for Graphromer, we use the default of Graphromer-slim. For GPS, the script is found in scripts/pom/gpsconv.py
.
The model of CheMix has been modified with different prediction heads. The results are found in scripts/chemix/results/random_cv/
. We finally use the scaled cosine distance, which is reflected in the final model architecture.