PaccMann
diff --git a/‎.github/workflows/build.yml
Lines changed: 0 additions & 1 deletion b/‎.github/workflows/build.yml
Lines changed: 0 additions & 1 deletion
diff --git a/‎.gitignore
Lines changed: 3 additions & 0 deletions b/‎.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 28 additions & 23 deletions b/‎README.md
Lines changed: 28 additions & 23 deletions
diff --git a/‎examples/IC50/conda.yml
Lines changed: 5 additions & 2 deletions b/‎examples/IC50/conda.yml
Lines changed: 5 additions & 2 deletions
diff --git a/‎examples/IC50/paccmann_v2_params.json
Lines changed: 57 additions & 0 deletions b/‎examples/IC50/paccmann_v2_params.json
Lines changed: 57 additions & 0 deletions
diff --git a/‎examples/IC50/requirements.txt
Lines changed: 0 additions & 4 deletions b/‎examples/IC50/requirements.txt
Lines changed: 0 additions & 4 deletions
diff --git a/‎examples/IC50/test_paccmann.py
Lines changed: 219 additions & 0 deletions b/‎examples/IC50/test_paccmann.py
Lines changed: 219 additions & 0 deletions
@@ -57,7 +57,6 @@ jobs:
       - name: IC50 - Install dependencies and run tests
         run: |
           python3 -m pip install --upgrade pip
-          pip3 install --no-cache-dir -r examples/IC50/requirements.txt
           pip3 install --no-deps .
           python3 -c "import paccmann_predictor"
           python3 examples/IC50/train_paccmann.py -h
 
@@ -118,3 +118,6 @@ ENV/
 
 # trained models
 /models
+
+# IDE
+.vscode/
@@ -10,44 +10,49 @@ anticancer drug sensitivity prediction and drug target affinity prediction. Plea
 
 - [_Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders_](https://doi.org/10.1021/acs.molpharmaceut.9b00520) (*Molecular Pharmaceutics*, 2019). This is the original paper on IC50 prediction using drug properties and tissue-specific cell line information (gene expression profiles). While the original code was written in `tensorflow` and is available [here](https://github.com/drugilsberg/paccmann), this is the `pytorch` implementation of the best PaccMann architecture (multiscale convolutional encoder).
 
-- [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2](https://iopscience.iop.org/article/10.1088/2632-2153/abe808) (_Machine Learning: Science and Technology_, 2021). In there, we propose a slightly modified version to predict drug-target binding affinities based on protein sequences and SMILES
-
-
-*NOTE*: PaccMann acronyms "Prediction of AntiCancer Compound sensitivity with Multi-modal Attention-based Neural Networks".
 
 **PaccMann for affinity prediction:**
-![Graphical abstract](https://github.com/PaccMann/paccmann_predictor/blob/master/assets/paccmann.png "Graphical abstract")
-
-
-## Requirements
+- [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2](https://iopscience.iop.org/article/10.1088/2632-2153/abe808) (_Machine Learning: Science and Technology_, 2021). In there, we propose a slightly modified version to predict drug-target binding affinities based on protein sequences and SMILES
 
-- `conda>=3.7`
+![Graphical abstract](https://github.com/PaccMann/paccmann_predictor/blob/master/assets/paccmann.png "Graphical abstract")
 
 ## Installation
-
 The library itself has few dependencies (see [setup.py](setup.py)) with loose requirements. 
-To run the example training script we provide environment files under `examples/IC50/`.
-
-Create a conda environment:
-
+First, set up the environment as follows:
 ```sh
 conda env create -f examples/IC50/conda.yml
-```
-
-Activate the environment:
-
-```sh
 conda activate paccmann_predictor
+pip install -e .
 ```
 
-Install in editable mode for development:
 
-```sh
-pip install -e .
+## Evaluate pretrained drug sensitivty model on your own data
+First, please consider using our public [PaccMann webservice](https://ibm.biz/paccmann-aas) as described in the [NAR paper](https://academic.oup.com/nar/article/48/W1/W502/5836770).
+
+To use our pretrained model, please download the model from: https://ibm.biz/paccmann-data (just download `models/single_pytorch_model`).
+For example, assuming that you downloaded this model in a directory called `single_pytorch_model`, the data from https://ibm.box.com/v/paccmann-pytoda-data in folders `data` and `splitted_data` the following command should work:
+```console
+(paccmann_predictor) $ python examples/IC50/test_paccmann.py \
+splitted_data/gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv \
+data/gene_expression/gdsc-rnaseq_gene-expression.csv \
+data/smiles/gdsc.smi \
+data/2128_genes.pkl \
+single_pytorch_model/smiles_language \
+single_pytorch_model/weights/best_mse_paccmann_v2.pt \
+results \
+single_pytorch_model/model_params.json
 ```
+*NOTE*: If you bring your own data, please make sure to provide the omic data for the 2128 genes specified in `data/2128_genes.pkl`. Your omic data (here it is `data/gene_expression/gdsc-rnaseq_gene-expression.csv`) can contain more columns and it does not need to follow the order of the pickled gene list. But please dont change this pickle file. Also note that this is PaccMannV2 which is slightly improved compared to the paper version (context attention on both modalities).
 
-## Example usage
+## Finetuning on your own data
+You can also **finetune** our pretrained model on your data instead of training a model from scratch. For that, please follow the instruction below for training on scratch and just set:
+- `model_path` --> directory where the `single_pytorch_model` is stored
+- `training_name` --> this should be `single_pytorch_model`
+- `params_filepath` --> `base_path/single_pytorch_model/model_params.json`
 
+
+## Training a model from scratch
+To run the example training script we provide environment files under `examples/IC50/`.
 In the `examples` directory is a training script [train_paccmann.py](./examples/IC50/train_paccmann.py) that makes use
 of `paccmann_predictor`.
 
 
@@ -6,7 +6,10 @@ dependencies:
   - python>=3.6,<3.8
   - pip>=19.1
   - pip:
-    - pytoda @ git+https://github.com/PaccMann/paccmann_datasets@0.0.3
+    - pytoda==1.0.0
     - numpy>=1.14.3
     - scipy>=1.3.1
-    - torch==1.0.1
+    - torch>=1.7.1
+    - tqdm
+    - pandas
+
@@ -0,0 +1,57 @@
+{
+    "drug_sensitivity_min_max": true,
+    "augment_smiles": true,
+    "smiles_start_stop_token": true,
+    "number_of_genes": 2128,
+    "smiles_padding_length": 512,
+    "stacked_dense_hidden_sizes": [
+        1024,
+        512
+    ],
+    "activation_fn": "relu",
+    "dropout": 0.5,
+    "batch_norm": true,
+    "filters": [
+        64,
+        64,
+        64
+    ],
+    "molecule_heads": [
+        4,
+        4,
+        4,
+        4
+    ],
+    "gene_heads": [
+        2,
+        2,
+        2,
+        2
+    ],
+    "smiles_embedding_size": 16,
+    "kernel_sizes": [
+        [
+            3,
+            16
+        ],
+        [
+            5,
+            16
+        ],
+        [
+            11,
+            16
+        ]
+    ],
+    "smiles_attention_size": 64,
+    "gene_attention_size": 1,
+    "embed_scale_grad": false,
+    "final_activation": true,
+    "batch_size": 256,
+    "lr": 0.01,
+    "optimizer": "adam",
+    "loss_fn": "mse",
+    "epochs": 10,
+    "save_model": 25,
+    "dataset_device": "cpu"
+}
@@ -0,0 +1,219 @@
+#!/usr/bin/env python3
+"""Test PaccMann predictor."""
+import argparse
+import json
+import logging
+import os
+import pickle
+import sys
+from copy import deepcopy
+
+import numpy as np
+import pandas as pd
+import torch
+from tqdm import tqdm
+from paccmann_predictor.models import MODEL_FACTORY
+from paccmann_predictor.utils.hyperparams import OPTIMIZER_FACTORY
+from paccmann_predictor.utils.utils import get_device
+from pytoda.datasets import DrugSensitivityDataset
+from pytoda.smiles.smiles_language import SMILESTokenizer
+from scipy.stats import pearsonr
+
+# setup logging
+logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    'test_sensitivity_filepath', type=str,
+    help='Path to the drug sensitivity (IC50) data.'
+)
+parser.add_argument(
+    'gep_filepath', type=str,
+    help='Path to the gene expression profile data.'
+)
+parser.add_argument(
+    'smi_filepath', type=str,
+    help='Path to the SMILES data.'
+)
+parser.add_argument(
+    'gene_filepath', type=str,
+    help='Path to a pickle object containing list of genes.'
+)
+parser.add_argument(
+    'smiles_language_filepath', type=str,
+    help='Path to a folder with SMILES language .json files.'
+)
+parser.add_argument(
+    'model_filepath', type=str,
+    help='Path to the stored model.'
+)
+parser.add_argument(
+    'predictions_filepath', type=str,
+    help='Path to the predictions.'
+)
+parser.add_argument(
+    'params_filepath', type=str,
+    help='Path to the parameter file.'
+)
+# yapf: enable
+
+
+def main(
+    test_sensitivity_filepath, gep_filepath,
+    smi_filepath, gene_filepath, smiles_language_filepath, model_filepath, predictions_filepath,
+    params_filepath
+):
+
+    logger = logging.getLogger('test')
+    # Process parameter file:
+    params = {}
+    with open(params_filepath) as fp:
+        params.update(json.load(fp))
+
+
+    # Prepare the dataset
+    logger.info("Start data preprocessing...")
+
+    # Load SMILES language
+    smiles_language = SMILESTokenizer.from_pretrained(smiles_language_filepath)
+    smiles_language.set_encoding_transforms(
+        add_start_and_stop=params.get('add_start_and_stop', True),
+        padding=params.get('padding', True),
+        padding_length=params.get('smiles_padding_length', None)
+    )
+    test_smiles_language = deepcopy(smiles_language)
+    smiles_language.set_smiles_transforms(
+        augment=params.get('augment_smiles', False),
+        canonical=params.get('smiles_canonical', False),
+        kekulize=params.get('smiles_kekulize', False),
+        all_bonds_explicit=params.get('smiles_bonds_explicit', False),
+        all_hs_explicit=params.get('smiles_all_hs_explicit', False),
+        remove_bonddir=params.get('smiles_remove_bonddir', False),
+        remove_chirality=params.get('smiles_remove_chirality', False),
+        selfies=params.get('selfies', False),
+        sanitize=params.get('selfies', False)
+    )
+    test_smiles_language.set_smiles_transforms(
+        augment=False,
+        canonical=params.get('test_smiles_canonical', False),
+        kekulize=params.get('smiles_kekulize', False),
+        all_bonds_explicit=params.get('smiles_bonds_explicit', False),
+        all_hs_explicit=params.get('smiles_all_hs_explicit', False),
+        remove_bonddir=params.get('smiles_remove_bonddir', False),
+        remove_chirality=params.get('smiles_remove_chirality', False),
+        selfies=params.get('selfies', False),
+        sanitize=params.get('selfies', False)
+    )
+
+    # Load the gene list
+    with open(gene_filepath, 'rb') as f:
+        gene_list = pickle.load(f)
+
+    # Assemble test dataset
+    test_dataset = DrugSensitivityDataset(
+        drug_sensitivity_filepath=test_sensitivity_filepath,
+        smi_filepath=smi_filepath,
+        gene_expression_filepath=gep_filepath,
+        smiles_language=test_smiles_language,
+        gene_list=gene_list,
+        drug_sensitivity_min_max=params.get('drug_sensitivity_min_max', True),
+        gene_expression_standardize=params.get(
+            'gene_expression_standardize', True
+        ),
+        gene_expression_min_max=params.get('gene_expression_min_max', False),
+        gene_expression_processing_parameters=params.get(
+            'gene_expression_processing_parameters', {}
+        ),
+        device=torch.device(params.get('dataset_device', 'cpu')),
+        iterate_dataset=False
+    )
+    test_loader = torch.utils.data.DataLoader(
+        dataset=test_dataset,
+        batch_size=params['batch_size'],
+        shuffle=False,
+        drop_last=False,
+        num_workers=params.get('num_workers', 0)
+    )
+    logger.info(
+        f'Test dataset has {len(test_dataset)} samples with {len(test_loader)} batches'
+    )
+
+    device = get_device()
+    logger.info(
+        f'Device for data loader is {test_dataset.device} and for '
+        f'model is {device}'
+    )
+
+    model_name = params.get('model_fn', 'paccmann')
+    model = MODEL_FACTORY[model_name](params).to(device)
+    model._associate_language(smiles_language)
+    try:
+        logger.info(f'Attempting to restore model from {model_filepath}...')
+        model.load(model_filepath, map_location=device)
+    except Exception:
+        raise ValueError(f'Error in restoring model from {model_filepath}!')
+
+    # Define optimizer
+    optimizer = (
+        OPTIMIZER_FACTORY[params.get('optimizer', 'Adam')]
+        (model.parameters(), lr=params.get('lr', 0.01))
+    )
+
+    num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    params.update({'number_of_parameters': num_params})
+    logger.info(f'Number of parameters {num_params}')
+
+    # Start testing
+    logger.info('Testing about to start... \n')
+    model.eval()
+
+    with torch.no_grad():
+        test_loss = 0
+        predictions = []
+        # gene_attentions = []
+        # epistemic_confs = []
+        # aleatoric_confs = []
+        labels = []
+        for ind, (smiles, gep, y) in tqdm(enumerate(test_loader)):
+            y_hat, pred_dict = model(
+                torch.squeeze(smiles.to(device)), gep.to(device), confidence = False
+            )
+            predictions.extend(list(y_hat.detach().cpu().squeeze().numpy()))
+            # gene_attentions.append(pred_dict['gene_attention'])
+            # epistemic_confs.append(pred_dict['epistemic_confidence'])
+            # aleatoric_confs.append(pred_dict['aleatoric_confidence'])
+            labels.extend(list(y.detach().cpu().squeeze().numpy()))
+            loss = model.loss(y_hat, y.to(device))
+            test_loss += loss.item()
+
+    #gene_attentions = np.array([a.cpu().numpy() for atts in gene_attentions for a in atts])
+    #epistemic_confs = np.array([c.cpu().numpy() for conf in epistemic_confs for c in conf]).ravel()
+    #aleatoric_confs = np.array([c.cpu().numpy() for conf in aleatoric_confs for c in conf]).ravel()
+    predictions = np.array(predictions)
+    labels = np.array(labels)
+
+    pearson = pearsonr(predictions, labels)[0]
+    rmse = np.sqrt(np.mean((predictions - labels)**2))
+    loss = test_loss / len(test_loader)
+    logger.info(
+        f"\t**RESULT**\t loss:{loss:.5f}, Pearson: {pearson:.3f}, RMSE: {rmse:.3f}"
+    )
+
+    df = test_dataset.drug_sensitivity_df
+    df['prediction'] = predictions
+    df.to_csv(predictions_filepath+'.csv')
+
+    #np.save(predictions_filepath+'_gene_attention.npy', gene_attentions)
+    #np.save(predictions_filepath+'_epistemic_confidence.npy', epistemic_confs)
+    #np.save(predictions_filepath+'_aleatoric_confidence.npy', aleatoric_confs)
+
+if __name__ == '__main__':
+    # parse arguments
+    args = parser.parse_args()
+    # run the testing
+    main(
+        args.test_sensitivity_filepath,
+        args.gep_filepath, args.smi_filepath, args.gene_filepath,
+        args.smiles_language_filepath, args.model_filepath, args.predictions_filepath, args.params_filepath
+    )