Activity Differences - Quantatative Structure Activity Relationship
Park, G.J., Kang, N.S.
Journal of Computer-Aided Molecular Design 37, 435–451 (2023).
This code was tested in Python 3.11.
A yaml file containing all requirements is provided.
This can be readily setup using conda.
conda env create -f ADis-QSAR-env.yaml
conda activate ADis-QSAR-env
- Prepare
Check the collected compounds from ChEMBL and classified them into active and inactive
Compounds are automatically classified using the criteria below:
active (IC50, Ki, Kd <= 100nM), inactive (IC50, Ki, Kd >= 1000nM, %Inhibition <= 20%)
The data format is based on ChEMBL ('-chembl' option)
If add %Inhibition assay results ('-i' option)
This code can only be applied to raw data of ChEMBL
Outputs : active, inactive and total compounds
python Prepare.py -d raw_data_path -o output_path -i -chembl
For example:
python Prepare.py -d Dataset/ChEMBL/ALK/ALK_raw.tsv -o Dataset/ChEMBL/ALK -i -chembl
- Preprocess
Selecting central structures (50 compounds) and generating descriptors using a pair system
Fingerprint types : radius_size (2: ECFP4, 3: ECFP6), number_of_bits (256, 512)
The scaler can be chosen from three options: Standard, MinMax and Robust
Afterwards, the compounds are divided into training (train), validation (valid) and test (test) sets
If you want to generate test set with other data ('-t' option)
The default value for the validation set size is 0.2, but it can be changed ('-v' option)
The number of active and inactive compounds is automatically adjusted to a ratio of 1:1.5 each set
Outputs : g1 (50 compounds), train, valid and test sets
python Preprocessing.py -a active_path -i inactive_path -o output_path -v valid_size -r radius_size -b number_of_bits -s scaler_type -core num_cores -t
For example:
python Preprocessing.py -a Dataset/ChEMBL/ALK/ALK_prepare/ALK_active.tsv -i Dataset/ChEMBL/ALK/ALK_prepare/ALK_inactive.tsv -o Dataset/ChEMBL/ALK -v 0.2 -r 2 -b 256 -s Standard -core 12
- ADis_QSAR
Start model training
Use model type such as SVM, MLP, RF and XGB ('-m' option)
If test set is available ('-test' option)
The test set does not participate in training/validation
You can obtain directly prediction results from the generated model
Outputs : model, log files
python ADis_QSAR.py -train train_path -valid valid_path -test test_path -m model_type -o output_path -core num_cores
For example:
python ADis_QSAR.py -train Dataset/ChEMBL/ALK/ALK_preprocessing/ALK_train_vector.tsv -valid Dataset/ChEMBL/ALK/ALK_preprocessing/ALK_valid_vector.tsv -test Dataset/ChEMBL/ALK/ALK_preprocessing/ALK_test_vector.tsv -m SVM -o Dataset/ChEMBL/ALK/ALK_preprocessing -core 12
- Predict
Predicting external dataset from the generated model
If you would like to apply an external dataset to the trained model, use the following code
Outputs : predict log file
python Predict.py -m model_path -e external_path -n external_name -o output_path -ev
For example:
python Predict.py -m Dataset/ChEMBL/ALK/ALK_preprocessing/ALK_model/SVM/ALK_SVM_model.pkl -e Dataset/ChEMBL/ALK/ALK_preprocessing/ALK_test_vector.tsv -n ext -o Dataset/ChEMBL/ALK -core 12 -ev
This code for generating a baseline model for comparing the performance of ADis-QSAR
Training a model using raw fingerprints (binary data) with the ADis-QSAR approach
The entire process for performing ADis-QSAR is executed automatically
python Baseline_run.py
This is the execution code for comparing the performance of ADis-QSAR based on various parameter changes
The parameters that are being modified are as follows:
a. number of center structures (g1) : [20, 50, 80]
b. vary the radius size : [ECFP4, ECFP6]
c. vary the number of bits : [256, 512]
d. vary the scaler : [ECFP4, ECFP6]
The entire process of performing ADis-QSAR, including parameter switching, is performed automatically
python Vary_params_run.py
Please submit a GitHub issue or contact me [email protected]
Thank you for our Laboratory.
If you find this code useful, please consider citing my work.