Skip to content

Commit 4082161

Browse files
jannisborneovchinnOvchinnikova Katja (ko20g613)Ovchinnikova Katja (ko20g613)
authored
Merging PaccMannV2 and improved documentation (#19)
* wip: Bumping paccmann training script to new pytoda * feat: ConextAttention - sequence averaging optional * feat: PaccMannV2 with ContextAttention on genes added * feat: Dose response PaccMann model and training script * dose added at test, predictions and labels ravel * dose added at test, predictions and labels ravel * minor fix for computing pearsonr for dataset.devide == cuda mode * minor fixes for using gpu and default version of torch, params changed * test_paccmann_dose.py added for applying trained model to new data * shuffle and drop_last set to False for validation and testing * fix related to torch version update * minor output path fix * smiles_language for test_dataset changed to test_smiles_language * save gene attention * wip: Bumping paccmann training script to new pytoda * feat: ConextAttention - sequence averaging optional * feat: PaccMannV2 with ContextAttention on genes added * feat: Dose response PaccMann model and training script * dose added at test, predictions and labels ravel * dose added at test, predictions and labels ravel * minor fix for computing pearsonr for dataset.devide == cuda mode * minor fixes for using gpu and default version of torch, params changed * test_paccmann_dose.py added for applying trained model to new data * shuffle and drop_last set to False for validation and testing * fix related to torch version update * minor output path fix * smiles_language for test_dataset changed to test_smiles_language * save gene attention * Confidence in dose model (#11) * refactor: set up logging in model classes and adapt confidence logic to tuples * feat: Add confidence computation in dose model * confidence added in the testing script Co-authored-by: Ovchinnikova Katja (ko20g613) <[email protected]> * script for finetuning a model added * minor fix * feat: KNN predictor for DrugResponse (#12) * feat: KNN predictor for DrugResponse * refactor: KNN operates on list instead of DF * script to run knn model * knn_dose model added * minor fix * out of loop calc of cell and drug distances * index col added * code optimized, dose loop replaced with vectorization * default k=3 * minor output fix * logging improved * logging improved * fixed sorting bug * add normalization * script for testing IC50 model Co-authored-by: eovchinn <[email protected]> Co-authored-by: Ovchinnikova Katja (ko20g613) <[email protected]> * wip: KNN modulariztion (FP and chirality type) * Revert "refactor: remove additional examples" This reverts commit 77ddae9. * feat: Testing script for latest paccmann models * wip: newest pytoda eval script * Example (#18) * doc: Beautify READDME * ci: set up GA (#15) * doc: ascii compatible bibtex [skip ci] * rebase to dev * wip: remove dose specific stuff * wip: remove kinase files and fix CI * fix: update model factory Co-authored-by: eovchinn <[email protected]> Co-authored-by: Ovchinnikova Katja (ko20g613) <[email protected]> Co-authored-by: Ovchinnikova Katja (ko20g613) <[email protected]>
1 parent 907dc06 commit 4082161

File tree

16 files changed

+1670
-180
lines changed

16 files changed

+1670
-180
lines changed

.github/workflows/build.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,6 @@ jobs:
5757
- name: IC50 - Install dependencies and run tests
5858
run: |
5959
python3 -m pip install --upgrade pip
60-
pip3 install --no-cache-dir -r examples/IC50/requirements.txt
6160
pip3 install --no-deps .
6261
python3 -c "import paccmann_predictor"
6362
python3 examples/IC50/train_paccmann.py -h

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,3 +118,6 @@ ENV/
118118

119119
# trained models
120120
/models
121+
122+
# IDE
123+
.vscode/

README.md

Lines changed: 28 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,44 +10,49 @@ anticancer drug sensitivity prediction and drug target affinity prediction. Plea
1010

1111
- [_Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders_](https://doi.org/10.1021/acs.molpharmaceut.9b00520) (*Molecular Pharmaceutics*, 2019). This is the original paper on IC50 prediction using drug properties and tissue-specific cell line information (gene expression profiles). While the original code was written in `tensorflow` and is available [here](https://github.com/drugilsberg/paccmann), this is the `pytorch` implementation of the best PaccMann architecture (multiscale convolutional encoder).
1212

13-
- [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2](https://iopscience.iop.org/article/10.1088/2632-2153/abe808) (_Machine Learning: Science and Technology_, 2021). In there, we propose a slightly modified version to predict drug-target binding affinities based on protein sequences and SMILES
14-
15-
16-
*NOTE*: PaccMann acronyms "Prediction of AntiCancer Compound sensitivity with Multi-modal Attention-based Neural Networks".
1713

1814
**PaccMann for affinity prediction:**
19-
![Graphical abstract](https://github.com/PaccMann/paccmann_predictor/blob/master/assets/paccmann.png "Graphical abstract")
20-
21-
22-
## Requirements
15+
- [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2](https://iopscience.iop.org/article/10.1088/2632-2153/abe808) (_Machine Learning: Science and Technology_, 2021). In there, we propose a slightly modified version to predict drug-target binding affinities based on protein sequences and SMILES
2316

24-
- `conda>=3.7`
17+
![Graphical abstract](https://github.com/PaccMann/paccmann_predictor/blob/master/assets/paccmann.png "Graphical abstract")
2518

2619
## Installation
27-
2820
The library itself has few dependencies (see [setup.py](setup.py)) with loose requirements.
29-
To run the example training script we provide environment files under `examples/IC50/`.
30-
31-
Create a conda environment:
32-
21+
First, set up the environment as follows:
3322
```sh
3423
conda env create -f examples/IC50/conda.yml
35-
```
36-
37-
Activate the environment:
38-
39-
```sh
4024
conda activate paccmann_predictor
25+
pip install -e .
4126
```
4227

43-
Install in editable mode for development:
4428

45-
```sh
46-
pip install -e .
29+
## Evaluate pretrained drug sensitivty model on your own data
30+
First, please consider using our public [PaccMann webservice](https://ibm.biz/paccmann-aas) as described in the [NAR paper](https://academic.oup.com/nar/article/48/W1/W502/5836770).
31+
32+
To use our pretrained model, please download the model from: https://ibm.biz/paccmann-data (just download `models/single_pytorch_model`).
33+
For example, assuming that you downloaded this model in a directory called `single_pytorch_model`, the data from https://ibm.box.com/v/paccmann-pytoda-data in folders `data` and `splitted_data` the following command should work:
34+
```console
35+
(paccmann_predictor) $ python examples/IC50/test_paccmann.py \
36+
splitted_data/gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv \
37+
data/gene_expression/gdsc-rnaseq_gene-expression.csv \
38+
data/smiles/gdsc.smi \
39+
data/2128_genes.pkl \
40+
single_pytorch_model/smiles_language \
41+
single_pytorch_model/weights/best_mse_paccmann_v2.pt \
42+
results \
43+
single_pytorch_model/model_params.json
4744
```
45+
*NOTE*: If you bring your own data, please make sure to provide the omic data for the 2128 genes specified in `data/2128_genes.pkl`. Your omic data (here it is `data/gene_expression/gdsc-rnaseq_gene-expression.csv`) can contain more columns and it does not need to follow the order of the pickled gene list. But please dont change this pickle file. Also note that this is PaccMannV2 which is slightly improved compared to the paper version (context attention on both modalities).
4846

49-
## Example usage
47+
## Finetuning on your own data
48+
You can also **finetune** our pretrained model on your data instead of training a model from scratch. For that, please follow the instruction below for training on scratch and just set:
49+
- `model_path` --> directory where the `single_pytorch_model` is stored
50+
- `training_name` --> this should be `single_pytorch_model`
51+
- `params_filepath` --> `base_path/single_pytorch_model/model_params.json`
5052

53+
54+
## Training a model from scratch
55+
To run the example training script we provide environment files under `examples/IC50/`.
5156
In the `examples` directory is a training script [train_paccmann.py](./examples/IC50/train_paccmann.py) that makes use
5257
of `paccmann_predictor`.
5358

examples/IC50/conda.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ dependencies:
66
- python>=3.6,<3.8
77
- pip>=19.1
88
- pip:
9-
- pytoda @ git+https://github.com/PaccMann/paccmann_datasets@0.0.3
9+
- pytoda==1.0.0
1010
- numpy>=1.14.3
1111
- scipy>=1.3.1
12-
- torch==1.0.1
12+
- torch>=1.7.1
13+
- tqdm
14+
- pandas
15+

examples/IC50/paccmann_v2_params.json

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
{
2+
"drug_sensitivity_min_max": true,
3+
"augment_smiles": true,
4+
"smiles_start_stop_token": true,
5+
"number_of_genes": 2128,
6+
"smiles_padding_length": 512,
7+
"stacked_dense_hidden_sizes": [
8+
1024,
9+
512
10+
],
11+
"activation_fn": "relu",
12+
"dropout": 0.5,
13+
"batch_norm": true,
14+
"filters": [
15+
64,
16+
64,
17+
64
18+
],
19+
"molecule_heads": [
20+
4,
21+
4,
22+
4,
23+
4
24+
],
25+
"gene_heads": [
26+
2,
27+
2,
28+
2,
29+
2
30+
],
31+
"smiles_embedding_size": 16,
32+
"kernel_sizes": [
33+
[
34+
3,
35+
16
36+
],
37+
[
38+
5,
39+
16
40+
],
41+
[
42+
11,
43+
16
44+
]
45+
],
46+
"smiles_attention_size": 64,
47+
"gene_attention_size": 1,
48+
"embed_scale_grad": false,
49+
"final_activation": true,
50+
"batch_size": 256,
51+
"lr": 0.01,
52+
"optimizer": "adam",
53+
"loss_fn": "mse",
54+
"epochs": 10,
55+
"save_model": 25,
56+
"dataset_device": "cpu"
57+
}

examples/IC50/requirements.txt

Lines changed: 0 additions & 4 deletions
This file was deleted.

examples/IC50/test_paccmann.py

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
#!/usr/bin/env python3
2+
"""Test PaccMann predictor."""
3+
import argparse
4+
import json
5+
import logging
6+
import os
7+
import pickle
8+
import sys
9+
from copy import deepcopy
10+
11+
import numpy as np
12+
import pandas as pd
13+
import torch
14+
from tqdm import tqdm
15+
from paccmann_predictor.models import MODEL_FACTORY
16+
from paccmann_predictor.utils.hyperparams import OPTIMIZER_FACTORY
17+
from paccmann_predictor.utils.utils import get_device
18+
from pytoda.datasets import DrugSensitivityDataset
19+
from pytoda.smiles.smiles_language import SMILESTokenizer
20+
from scipy.stats import pearsonr
21+
22+
# setup logging
23+
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
24+
25+
# yapf: disable
26+
parser = argparse.ArgumentParser()
27+
parser.add_argument(
28+
'test_sensitivity_filepath', type=str,
29+
help='Path to the drug sensitivity (IC50) data.'
30+
)
31+
parser.add_argument(
32+
'gep_filepath', type=str,
33+
help='Path to the gene expression profile data.'
34+
)
35+
parser.add_argument(
36+
'smi_filepath', type=str,
37+
help='Path to the SMILES data.'
38+
)
39+
parser.add_argument(
40+
'gene_filepath', type=str,
41+
help='Path to a pickle object containing list of genes.'
42+
)
43+
parser.add_argument(
44+
'smiles_language_filepath', type=str,
45+
help='Path to a folder with SMILES language .json files.'
46+
)
47+
parser.add_argument(
48+
'model_filepath', type=str,
49+
help='Path to the stored model.'
50+
)
51+
parser.add_argument(
52+
'predictions_filepath', type=str,
53+
help='Path to the predictions.'
54+
)
55+
parser.add_argument(
56+
'params_filepath', type=str,
57+
help='Path to the parameter file.'
58+
)
59+
# yapf: enable
60+
61+
62+
def main(
63+
test_sensitivity_filepath, gep_filepath,
64+
smi_filepath, gene_filepath, smiles_language_filepath, model_filepath, predictions_filepath,
65+
params_filepath
66+
):
67+
68+
logger = logging.getLogger('test')
69+
# Process parameter file:
70+
params = {}
71+
with open(params_filepath) as fp:
72+
params.update(json.load(fp))
73+
74+
75+
# Prepare the dataset
76+
logger.info("Start data preprocessing...")
77+
78+
# Load SMILES language
79+
smiles_language = SMILESTokenizer.from_pretrained(smiles_language_filepath)
80+
smiles_language.set_encoding_transforms(
81+
add_start_and_stop=params.get('add_start_and_stop', True),
82+
padding=params.get('padding', True),
83+
padding_length=params.get('smiles_padding_length', None)
84+
)
85+
test_smiles_language = deepcopy(smiles_language)
86+
smiles_language.set_smiles_transforms(
87+
augment=params.get('augment_smiles', False),
88+
canonical=params.get('smiles_canonical', False),
89+
kekulize=params.get('smiles_kekulize', False),
90+
all_bonds_explicit=params.get('smiles_bonds_explicit', False),
91+
all_hs_explicit=params.get('smiles_all_hs_explicit', False),
92+
remove_bonddir=params.get('smiles_remove_bonddir', False),
93+
remove_chirality=params.get('smiles_remove_chirality', False),
94+
selfies=params.get('selfies', False),
95+
sanitize=params.get('selfies', False)
96+
)
97+
test_smiles_language.set_smiles_transforms(
98+
augment=False,
99+
canonical=params.get('test_smiles_canonical', False),
100+
kekulize=params.get('smiles_kekulize', False),
101+
all_bonds_explicit=params.get('smiles_bonds_explicit', False),
102+
all_hs_explicit=params.get('smiles_all_hs_explicit', False),
103+
remove_bonddir=params.get('smiles_remove_bonddir', False),
104+
remove_chirality=params.get('smiles_remove_chirality', False),
105+
selfies=params.get('selfies', False),
106+
sanitize=params.get('selfies', False)
107+
)
108+
109+
# Load the gene list
110+
with open(gene_filepath, 'rb') as f:
111+
gene_list = pickle.load(f)
112+
113+
# Assemble test dataset
114+
test_dataset = DrugSensitivityDataset(
115+
drug_sensitivity_filepath=test_sensitivity_filepath,
116+
smi_filepath=smi_filepath,
117+
gene_expression_filepath=gep_filepath,
118+
smiles_language=test_smiles_language,
119+
gene_list=gene_list,
120+
drug_sensitivity_min_max=params.get('drug_sensitivity_min_max', True),
121+
gene_expression_standardize=params.get(
122+
'gene_expression_standardize', True
123+
),
124+
gene_expression_min_max=params.get('gene_expression_min_max', False),
125+
gene_expression_processing_parameters=params.get(
126+
'gene_expression_processing_parameters', {}
127+
),
128+
device=torch.device(params.get('dataset_device', 'cpu')),
129+
iterate_dataset=False
130+
)
131+
test_loader = torch.utils.data.DataLoader(
132+
dataset=test_dataset,
133+
batch_size=params['batch_size'],
134+
shuffle=False,
135+
drop_last=False,
136+
num_workers=params.get('num_workers', 0)
137+
)
138+
logger.info(
139+
f'Test dataset has {len(test_dataset)} samples with {len(test_loader)} batches'
140+
)
141+
142+
device = get_device()
143+
logger.info(
144+
f'Device for data loader is {test_dataset.device} and for '
145+
f'model is {device}'
146+
)
147+
148+
model_name = params.get('model_fn', 'paccmann')
149+
model = MODEL_FACTORY[model_name](params).to(device)
150+
model._associate_language(smiles_language)
151+
try:
152+
logger.info(f'Attempting to restore model from {model_filepath}...')
153+
model.load(model_filepath, map_location=device)
154+
except Exception:
155+
raise ValueError(f'Error in restoring model from {model_filepath}!')
156+
157+
# Define optimizer
158+
optimizer = (
159+
OPTIMIZER_FACTORY[params.get('optimizer', 'Adam')]
160+
(model.parameters(), lr=params.get('lr', 0.01))
161+
)
162+
163+
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
164+
params.update({'number_of_parameters': num_params})
165+
logger.info(f'Number of parameters {num_params}')
166+
167+
# Start testing
168+
logger.info('Testing about to start... \n')
169+
model.eval()
170+
171+
with torch.no_grad():
172+
test_loss = 0
173+
predictions = []
174+
# gene_attentions = []
175+
# epistemic_confs = []
176+
# aleatoric_confs = []
177+
labels = []
178+
for ind, (smiles, gep, y) in tqdm(enumerate(test_loader)):
179+
y_hat, pred_dict = model(
180+
torch.squeeze(smiles.to(device)), gep.to(device), confidence = False
181+
)
182+
predictions.extend(list(y_hat.detach().cpu().squeeze().numpy()))
183+
# gene_attentions.append(pred_dict['gene_attention'])
184+
# epistemic_confs.append(pred_dict['epistemic_confidence'])
185+
# aleatoric_confs.append(pred_dict['aleatoric_confidence'])
186+
labels.extend(list(y.detach().cpu().squeeze().numpy()))
187+
loss = model.loss(y_hat, y.to(device))
188+
test_loss += loss.item()
189+
190+
#gene_attentions = np.array([a.cpu().numpy() for atts in gene_attentions for a in atts])
191+
#epistemic_confs = np.array([c.cpu().numpy() for conf in epistemic_confs for c in conf]).ravel()
192+
#aleatoric_confs = np.array([c.cpu().numpy() for conf in aleatoric_confs for c in conf]).ravel()
193+
predictions = np.array(predictions)
194+
labels = np.array(labels)
195+
196+
pearson = pearsonr(predictions, labels)[0]
197+
rmse = np.sqrt(np.mean((predictions - labels)**2))
198+
loss = test_loss / len(test_loader)
199+
logger.info(
200+
f"\t**RESULT**\t loss:{loss:.5f}, Pearson: {pearson:.3f}, RMSE: {rmse:.3f}"
201+
)
202+
203+
df = test_dataset.drug_sensitivity_df
204+
df['prediction'] = predictions
205+
df.to_csv(predictions_filepath+'.csv')
206+
207+
#np.save(predictions_filepath+'_gene_attention.npy', gene_attentions)
208+
#np.save(predictions_filepath+'_epistemic_confidence.npy', epistemic_confs)
209+
#np.save(predictions_filepath+'_aleatoric_confidence.npy', aleatoric_confs)
210+
211+
if __name__ == '__main__':
212+
# parse arguments
213+
args = parser.parse_args()
214+
# run the testing
215+
main(
216+
args.test_sensitivity_filepath,
217+
args.gep_filepath, args.smi_filepath, args.gene_filepath,
218+
args.smiles_language_filepath, args.model_filepath, args.predictions_filepath, args.params_filepath
219+
)

0 commit comments

Comments
 (0)