Skip to content

Commit 9739110

Browse files
author
John Ziegler
committed
clean up, added attention pooling, latest models, and tests
1 parent 60feb2e commit 9739110

16 files changed

+121
-133
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
*.pyc
22
.gitignore
3+
__pycache__
34
tests/__pycache__
5+
main/__pycache__
6+
model__pycache__
47
utils/microsatellites.list
58
data/generate_vectors/constants.py
69
main/constants.py

README.md

Lines changed: 15 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
A deep, multiple instance learning based classifier for identifying Microsatellite Instability in Next-Generation Sequencing Results.
44

55

6-
Made with :heart: and lots of :coffee: by ClinBx @ Memorial Sloan Kettering Cancer Center
6+
Made with :heart: and lots of :coffee: by the Clinical Bioinformatics Team @ Memorial Sloan Kettering Cancer Center
77

88
Preprint: https://www.biorxiv.org/content/10.1101/2020.09.16.299925v1
99

@@ -30,31 +30,20 @@ cd mimsi #Root directory of the repository that includes the setup.py script.
3030
pip install .
3131
```
3232

33-
Following successfuly installation, the functions required to run MiMsi should be directly registered as command-line accessible tools in your environment,for convenience. You can either call the scripts/modules explicitly or as command-line tools. Examples for both are provided below.
34-
35-
### Required Libraries
36-
37-
MiMSI is implemented in (Py)Torch and has been tested on Python 2.7, 3.5 and 3.6. We've included a requirements.txt file and setup.py script for use with pip.
38-
39-
Just note that the following packages are required:
40-
* (Py)Torch
41-
* Pandas
42-
* Numpy
43-
* Sklearn
44-
* Pysam
45-
46-
47-
If you wish to install using the requirements.txt provided, run:
33+
Following successfuly installation, the functions required to run MiMSI should be directly registered as command-line accessible tools in your environment.
4834

35+
You can run our test suite via:
4936
```
50-
pip install -r requirements.txt
37+
make test
5138
```
5239

40+
41+
5342
## Running a Full Analysis
5443

5544
MiMSI is comprised of two main steps. The first is an NGS Vector creation stage, where the aligned reads are encoded into a form that can be interpreted by our model. The second stage is the actual evaluation, where we input the collection of microsatellite instances for a sample into the pre-trained model to determine a classification for the collection. For convenience, we've packaged both of these stages into a python script that executes them together. If you'd like to run the steps individually (perhaps to train the model from scratch) please see the section "Running Analysis Components Separately" below.
5645

57-
More details on the MiMSI methods are available in out pre-print.
46+
If you're interested in learning more about the implementation of MiMSI please check out our [pre-print](https://www.biorxiv.org/content/10.1101/2020.09.16.299925v1)!
5847

5948

6049
### Required files
@@ -63,14 +52,14 @@ MiMSI requires two main inputs - the tumor/normal pair of ```.bam``` files for t
6352

6453
#### Tumor & Normal .bam files
6554

66-
If you are only analyzing one sample, the tumor and normal bam files can be specified as command line args (see "Testing an Individual Sample" below). Single sample analysis is performed only to evaluate the MSI/MSS status of the sample. Therefore, the label for that sample will be defaulted to '-1' (see below for more details regarding labels). Additionally, if you would like to process multiple samples in one batch the tool can accept a tab-separated file listing each sample with headers as shown below:
55+
If you are only analyzing one sample, the tumor and normal bam files can be specified as command line args (see "Testing an Individual Sample"). If you would like to process multiple samples in one batch the tool can accept a tab-separated file listing each sample with the following headers:
6756

6857
```
6958
Tumor_ID Tumor_Bam Normal_Bam Normal_ID Label
7059
sample-id-1 /tumor/bam/file/location /normal/bam/file/location normal1 -1
7160
sample-id-2 /tumor/bam/file/location2 /normal/bam/file/location2 normal2 -1
7261
```
73-
The columns can be in any order as long as the column headers match the headers shown above. Note that the character '_' is not allowed in Tumor_ID and Normal_ID values. The Normal_ID column/values are optional. If Normal_ID is given for a tumor/normal pair, the npy array and final results will include the Normal_ID. Otherwise, only Tumor_ID will be reported. The Tumor_Bam and Normal_Bam columns specify the full filesystem path for the tumor and normal ```.bam``` files, respectively. The Label column is also optional. Default value will be set to '-1'. See below for explanations of the labels. The vectors and final classification results will be named according to the Tumor_ID, and optionally, Normal_ID columns, so be sure to use something that is unique and easily recognizable to avoid errors.
62+
The columns can be in any order as long as the column headers match the headers shown above. Note that the character '_' is not allowed in Tumor_ID and Normal_ID values. The Normal_ID column/values are optional. If Normal_ID is given for a tumor/normal pair, the npy array and final results will include the Normal_ID. Otherwise, only Tumor_ID will be reported. The Tumor_Bam and Normal_Bam columns specify the full filesystem path for the tumor and normal ```.bam``` files, respectively. The label field is not necessary if you are evaluating cases, but it's included in the event that you would like to train & test a custom model on your own dataset (see "Training the Model from Scratch"). The vectors and final classification results will be named according to the Tumor_ID, and optionally, Normal_ID columns, so be sure to use something that is unique and easily recognizable to avoid errors.
7463

7564
#### List of Microsatellite Regions
7665

@@ -81,32 +70,17 @@ A list of microsatellite regions needs to be provided as a tab-separated text fi
8170
To run an individual sample,
8271

8372
```
84-
cd /path/to/mimsi
85-
python -m analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
86-
```
87-
88-
Or as command-line tool:
89-
90-
```
91-
analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
73+
analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list ./test/microsatellites_impact_only.list --save-location /path/to/save/ --model ./model/mi_msi_v0_4_0_200x.model --save
9274
```
9375

9476
The tumor-bam and normal-bam args specify the .bam files the pipeline will use when building the input vectors. These vectors will be saved to disk in the location indicated by the ```save-location``` arg. WARNING: Existing files in this directory in the formats *_locations.npy and *_data.npy will be deleted! The format for the filename of the built vectors is ```{case-id}_data_{label}.npy``` and ```{case-id}_locations_{label}.npy```. The ```data``` file contains the N x coverage x 40 x 3 vector for the sample, where N is the number of microsatellite loci that were successfully converted to vectors. The ```locations``` file is a list of all the loci used to build the ```data``` vector, with the same ordering. These locations are saved in the event that you'd like to investigate how different loci are processed. The label is defaulted to -1 for evaluation cases, and won't be used in any reported calculations. The label assignment is only relevant for training/testing the model from scratch (see below section on Training). The results of the classifier are printed to standard output, so you can capture the raw probability score by saving the output as shown in our example (single_case_analysis.out). If you want to do further processing on results, add the ```--save``` param to automatically save the classification score to disk. The evaluation script repeats 10x for each sample, that way you can create a confidence interval for each prediction.
9577

96-
This pipeline can be run on both GPU and CPU setups. We've also provided an example lsf submission file - ```/utils/single-sample-full.lsf```. Just keep in mind that your institution's lsf setup may differ from our example.
9778

9879
### Running samples in bulk
9980
Running a batch of samples is extremely similar, just provide a case list file rather than an individual tumor/normal pair,
10081

10182
```
102-
cd /path/to/mimsi
103-
python -m analyze --case-list /path/to/case_list.txt --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
104-
```
105-
106-
Or as command-line tool
107-
108-
```
109-
analyze --case-list /path/to/case_list.txt --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
83+
analyze --case-list /path/to/case_list.txt --microsatellites-list ./test/microsatellites_impact_only.list --save-location /path/to/save/ --model ./model/mi_msi_v0_4_0_200x.model --save
11084
```
11185

11286

@@ -119,48 +93,32 @@ The analysis pipeline automates these two steps, but if you'd like to run them i
11993
### Vector Creation
12094
In order to utilize our NGS alignments as inputs to a deep model they need to be converted into a vector. The full process will be outlined in our paper if you'd like to dive into the details, but for folks that are just interested in creating the data this step can be run individually via the ```data/generate_vectors/create_data.py``` script.
12195

122-
```
123-
cd /path/to/mimsi
124-
python -m data.generate_vectors.create_data --case-list /path/to/case_list.txt --cores 16 --microsatellites-list /path/to/microsatellites_list.txt --save-location /path/to/save/vectors --coverage 100 > generate_vector.out
125-
```
12696

127-
Or as command-line tool
12897
```
129-
create_data --case-list /path/to/case_list.txt --cores 16 --microsatellites-list /path/to/microsatellites_list.txt --save-location /path/to/save --coverage 100 > generate_vector.out
98+
create_data --case-list /path/to/case_list.txt --cores 16 --microsatellites-list ./test/microsatellites_impact_only.list --save-location /path/to/save --coverage 100 > generate_vector.out
13099
```
131100

132101

133-
The ```--is-labeled``` flag is especially important if you plan on using the built vectors for training and testing. If you're only evaluating a sample a default label of -1 will be used, but if you want to train/test you'll need to provide that information to the script so that a relevant loss can be calculated. Labels can be provided as a fourth column in the case list text file, and should take on a value of -1 for MSS cases and 1 for MSI cases. Is also worth noting that this script was designed to be run in a parallel computing environment. The microsatellites list file is broken up into chunks, and separate threads handle processing each chunk. The microsatellite instance vectors for each chunk (if any) are then aggregated to create the full sample. You can modify the number of concurrent slots by setting the ```--cores``` arg. An example lsf submission script is available in ```utils/multi-sample-vector-creation.lsf```
102+
The ```--is-labeled``` flag is especially important if you plan on using the built vectors for training and testing. If you're only evaluating a sample a default label of -1 will be used, but if you want to train/test you'll need to provide that information to the script so that a relevant loss can be calculated. Labels can be provided as a fourth column in the case list text file, and should take on a value of -1 for MSS cases and 1 for MSI cases. This script was designed to be run in a parallel computing environment. The microsatellites list file is broken up into chunks, and separate threads handle processing each chunk. The microsatellite instance vectors for each chunk (if any) are then aggregated to create the full sample. You can modify the number of concurrent slots by setting the ```--cores``` arg.
134103

135104
### Evaluation
136105
If you have a directory of built vectors you can run just the evaluation step using the ```main/evaluate_samples.py``` script.
137106

138107
```
139-
cd /path/to/mimsi
140-
python -m main.evaluate_sample --saved-model ./model/mimsi_mskcc_impact_v0_2_0.model--vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
141-
```
142-
143-
Or as command-line tool
144-
```
145-
evaluate_sample --saved-model ./model/mimsi_mskcc_impact_v0_2_0.model --vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
108+
evaluate_sample --saved-model ./model/mi_msi_v0_4_0_200x.model --vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
146109
```
147110

148111
Just as with the ```analyze.py``` script output will be directed to standard out, and you can save the results to disk via the ```--save``` flag.
149112

150113
## Training the Model from Scratch
151114
We've included a pre-trained model file in the ```model/``` directory for evaluation, but if you'd like to try to build a new version from scratch with your own data we've also included our training and testing script in ```main/mi_msi_train_test.py```
152115

153-
```
154-
cd /path/to/mimsi
155-
python -m main.mi_msi_train_test --name 'model_name' --train-location /path/to/train/vectors/ --test-location /path/to/train/vectors/ --save 1 > train_test.log
156-
```
157116

158-
Or as command-line tool
159117
```
160118
mi_msi_train_test --name 'model_name' --train-location /path/to/train/vectors/ --test-location /path/to/train/vectors/ --save 1 > train_test.log
161119
```
162120

163-
Just note that you will need labels for any vectors utilized in training and testing. This can be accomplished by setting the ```--labeled``` flag in the Vector Creation script (see section above) and including the labels as a final column in the vector generation case list. The weights of the best performing model will be saved as a Torch ".model" file based on the name specified in the command args. Additionally, key metrics (TPR, FPR, Thresholds, AUROC) and train/test loss information are saved to disk as a numpy array. These metrics are also printed to standard out if you'd prefer to capture the results that way. Training a dataset of ~300 cases with ~1000 instances per case takes approximately 5 hours on 2 Nvidia DGX1 GPUs, see ```utils/train_test.lsf``` for an example lsf submission script.
121+
Just note that you will need labels for any vectors utilized in training and testing. This can be accomplished by setting the ```--labeled``` flag in the Vector Creation script (see section above) and including the labels as a final column in the vector generation case list. The weights of the best performing model will be saved as a Torch ".model" file based on the name specified in the command args. Additionally, key metrics (TPR, FPR, Thresholds, AUROC) and train/test loss information are saved to disk as a numpy array. These metrics are also printed to standard out if you'd prefer to capture the results that way.
164122

165123
## Questions, Comments, Collaborations?
166124
Please feel free to reach out, I'm available via email - [email protected]

0 commit comments

Comments
 (0)