A library to build QSAR models fastly.
Install LazyQSAR from source:
git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .
- Choose one of the available descriptors of small molecules.
- Fit a model using AutoML. LazyQSAR will search several hyperparametrs.
- Get the validation of the model on the test set.
You can find example data in the fantastic Therapeutic Data Commons portal.
from tdc.single_pred import Tox
data = Tox(name = 'hERG')
split = data.get_split()
Here we are selecting the hERG blockade toxicity dataset. Let's refactor data for convenience.
# refactor fetched data in a convenient format
smiles_train = list(split["train"]["Drug"])
y_train = list(split["train"]["Y"])
smiles_valid = list(split["valid"]["Drug"])
y_valid = list(split["valid"]["Y"])
Now we can train a model based on Morgan fingerprints.
import lazyqsar as lq
model = lq.LazyBinaryQSAR(descriptor_type="morgan", model_type="xgboost")
model.fit(smiles_list=smiles_train, y=y_train)
model.save_model(model_dir="my_model")
from sklearn.metrics import roc_curve, auc
y_hat = model.predict_proba(smiles_valid)[:,1]
fpr, tpr, _ = roc_curve(y_valid, y_hat)
print("AUROC", auc(fpr, tpr))
In the current version of LazyQSAR regression is not yet implemented...
You can find example data in the fantastic Therapeutic Data Commons portal.
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
Here we are selecting the Acute Toxicity dataset. Let's refactor data for convenience.
# refactor fetched data in a convenient format
smiles_train = list(split["train"]["Drug"])
y_train = list(split["train"]["Y"])
smiles_valid = list(split["valid"]["Drug"])
y_valid = list(split["valid"]["Y"])
Now we can train a model based on Morgan fingerprints.
import lazyqsar as lq
model = lq.MorganRegressor()
# time_budget (in seconds) and estimator_list can be passed as parameters of the regressor. Defaults to 20s and all the available estimators in FLAML.
model.fit(smiles_train, y_train)
from sklearn.metrics import mean_absolute_error, r2_score
y_hat = model.predict(smiles_valid)
mae = mean_absolute_error(y_valid, y_hat)
r2 = r2_score(y_valid, y_hat)
print("MAE", mae, "R2", r2)
The pipeline has been validated using the Therapeutic Data Commons ADMET datasets. More information about its results can be found in the /benchmark folder.
This library is only intended for quick-and-dirty QSAR modeling. For a more complete automated QSAR modeling, please refer to Zaira Chem
Learn about the Ersilia Open Source Initiative!