mskcc
diff --git a/‎.gitignore
Lines changed: 3 additions & 0 deletions b/‎.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 15 additions & 57 deletions b/‎README.md
Lines changed: 15 additions & 57 deletions
@@ -1,6 +1,9 @@
 *.pyc
 .gitignore
+__pycache__
 tests/__pycache__
+main/__pycache__
+model__pycache__
 utils/microsatellites.list
 data/generate_vectors/constants.py
 main/constants.py
 
@@ -3,7 +3,7 @@
 A deep, multiple instance learning based classifier for identifying Microsatellite Instability in Next-Generation Sequencing Results. 
 
 
-Made with :heart: and lots of :coffee: by ClinBx @ Memorial Sloan Kettering Cancer Center
+Made with :heart: and lots of :coffee: by the Clinical Bioinformatics Team @ Memorial Sloan Kettering Cancer Center
 
 Preprint: https://www.biorxiv.org/content/10.1101/2020.09.16.299925v1
 
@@ -30,31 +30,20 @@ cd mimsi #Root directory of the repository that includes the setup.py script.
 pip install .
 ```
 
-Following successfuly installation, the functions required to run MiMsi should be directly registered as command-line accessible tools in your environment,for convenience. You can either call the scripts/modules explicitly or as command-line tools. Examples for both are provided below.
-
-### Required Libraries
-
-MiMSI is implemented in (Py)Torch and has been tested on Python 2.7, 3.5 and 3.6. We've included a requirements.txt file and setup.py script for use with pip. 
-
-Just note that the following packages are required:
-* (Py)Torch
-* Pandas
-* Numpy
-* Sklearn
-* Pysam
-  
-
-If you wish to install using the requirements.txt provided, run:
+Following successfuly installation, the functions required to run MiMSI should be directly registered as command-line accessible tools in your environment.
 
+You can run our test suite via:
 ```
-pip install -r requirements.txt
+make test
 ```
 
+
+
 ## Running a Full Analysis
 
 MiMSI is comprised of two main steps. The first is an NGS Vector creation stage, where the aligned reads are encoded into a form that can be interpreted by our model. The second stage is the actual evaluation, where we input the collection of microsatellite instances for a sample into the pre-trained model to determine a classification for the collection. For convenience, we've packaged both of these stages into a python script that executes them together. If you'd like to run the steps individually (perhaps to train the model from scratch) please see the section "Running Analysis Components Separately" below.
 
-More details on the MiMSI methods are available in out pre-print.
+If you're interested in learning more about the implementation of MiMSI please check out our [pre-print](https://www.biorxiv.org/content/10.1101/2020.09.16.299925v1)!
 
 
 ### Required files
@@ -63,14 +52,14 @@ MiMSI requires two main inputs - the tumor/normal pair of ```.bam``` files for t
 
 #### Tumor & Normal .bam files
 
-If you are only analyzing one sample, the tumor and normal bam files can be specified as command line args (see "Testing an Individual Sample" below). Single sample analysis is performed only to evaluate the MSI/MSS status of the sample. Therefore, the label for that sample will be defaulted to '-1' (see below for more details regarding labels). Additionally, if you would like to process multiple samples in one batch the tool can accept a tab-separated file listing each sample with headers as shown below:
+If you are only analyzing one sample, the tumor and normal bam files can be specified as command line args (see "Testing an Individual Sample"). If you would like to process multiple samples in one batch the tool can accept a tab-separated file listing each sample with the following headers:
 
 ```
 Tumor_ID	Tumor_Bam	Normal_Bam	Normal_ID	Label
 sample-id-1     /tumor/bam/file/location    /normal/bam/file/location   normal1	-1
 sample-id-2     /tumor/bam/file/location2    /normal/bam/file/location2 normal2	-1
 ```
-The columns can be in any order as long as the column headers match the headers shown above. Note that the character '_' is not allowed in Tumor_ID and Normal_ID values. The Normal_ID column/values are optional. If Normal_ID is given for a tumor/normal pair, the npy array and final results will include the Normal_ID. Otherwise, only Tumor_ID will be reported. The Tumor_Bam and Normal_Bam columns specify the full filesystem path for the tumor and normal ```.bam``` files, respectively. The Label column is also optional. Default value will be set to '-1'. See below for explanations of the labels. The vectors and final classification results will be named according to the Tumor_ID, and optionally, Normal_ID columns, so be sure to use something that is unique and easily recognizable to avoid errors.
+The columns can be in any order as long as the column headers match the headers shown above. Note that the character '_' is not allowed in Tumor_ID and Normal_ID values. The Normal_ID column/values are optional. If Normal_ID is given for a tumor/normal pair, the npy array and final results will include the Normal_ID. Otherwise, only Tumor_ID will be reported. The Tumor_Bam and Normal_Bam columns specify the full filesystem path for the tumor and normal ```.bam``` files, respectively. The label field is not necessary if you are evaluating cases, but it's included in the event that you would like to train & test a custom model on your own dataset (see "Training the Model from Scratch"). The vectors and final classification results will be named according to the Tumor_ID, and optionally, Normal_ID columns, so be sure to use something that is unique and easily recognizable to avoid errors.
 
 #### List of Microsatellite Regions
 
@@ -81,32 +70,17 @@ A list of microsatellite regions needs to be provided as a tab-separated text fi
 To run an individual sample,
 
 ```
-cd /path/to/mimsi
-python -m analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
-```
-
-Or as command-line tool:
-
-```
-analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
+analyze --tumor-bam /path/to/tumor.bam --normal-bam /path/to/normal.bam --case-id my-unique-tumor-id --norm-case-id my-unique-normal-id --microsatellites-list ./test/microsatellites_impact_only.list --save-location /path/to/save/ --model ./model/mi_msi_v0_4_0_200x.model --save
 ```
 
 The tumor-bam and normal-bam args specify the .bam files the pipeline will use when building the input vectors. These vectors will be saved to disk in the location indicated by the ```save-location``` arg. WARNING: Existing files in this directory in the formats *_locations.npy and *_data.npy will be deleted! The format for the filename of the built vectors is ```{case-id}_data_{label}.npy``` and ```{case-id}_locations_{label}.npy```. The ```data``` file contains the N x coverage x 40 x 3 vector for the sample, where N is the number of microsatellite loci that were successfully converted to vectors. The ```locations``` file is a list of all the loci used to build the ```data``` vector, with the same ordering. These locations are saved in the event that you'd like to investigate how different loci are processed. The label is defaulted to -1 for evaluation cases, and won't be used in any reported calculations. The label assignment is only relevant for training/testing the model from scratch (see below section on Training). The results of the classifier are printed to standard output, so you can capture the raw probability score by saving the output as shown in our example (single_case_analysis.out). If you want to do further processing on results, add the ```--save``` param to automatically save the classification score to disk. The evaluation script repeats 10x for each sample, that way you can create a confidence interval for each prediction.
 
-This pipeline can be run on both GPU and CPU setups. We've also provided an example lsf submission file - ```/utils/single-sample-full.lsf```. Just keep in mind that your institution's lsf setup may differ from our example.
 
 ### Running samples in bulk
 Running a batch of samples is extremely similar, just provide a case list file rather than an individual tumor/normal pair,
 
 ```
-cd /path/to/mimsi
-python -m analyze --case-list /path/to/case_list.txt --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
-```
-
-Or as command-line tool
-
-```
-analyze --case-list /path/to/case_list.txt --microsatellites-list /path/to/microsatellites_file --save-location /path/to/save/ --model ./model/mimsi_mskcc_impact_v0_2_0.model --save
+analyze --case-list /path/to/case_list.txt --microsatellites-list ./test/microsatellites_impact_only.list  --save-location /path/to/save/ --model ./model/mi_msi_v0_4_0_200x.model  --save
 ```
 
 
@@ -119,48 +93,32 @@ The analysis pipeline automates these two steps, but if you'd like to run them i
 ### Vector Creation
 In order to utilize our NGS alignments as inputs to a deep model they need to be converted into a vector. The full process will be outlined in our paper if you'd like to dive into the details, but for folks that are just interested in creating the data this step can be run individually via the ```data/generate_vectors/create_data.py``` script.
 
-```
-cd /path/to/mimsi
-python -m data.generate_vectors.create_data --case-list /path/to/case_list.txt --cores 16  --microsatellites-list /path/to/microsatellites_list.txt --save-location /path/to/save/vectors --coverage 100 > generate_vector.out
-```
 
-Or as command-line tool
 ```
-create_data --case-list /path/to/case_list.txt --cores 16  --microsatellites-list /path/to/microsatellites_list.txt --save-location /path/to/save --coverage 100 > generate_vector.out
+create_data --case-list /path/to/case_list.txt --cores 16  --microsatellites-list ./test/microsatellites_impact_only.list --save-location /path/to/save --coverage 100 > generate_vector.out
 ```
 
 
-The ```--is-labeled``` flag is especially important if you plan on using the built vectors for training and testing. If you're only evaluating a sample a default label of -1 will be used, but if you want to train/test you'll need to provide that information to the script so that a relevant loss can be calculated. Labels can be provided as a fourth column in the case list text file, and should take on a value of -1 for MSS cases and 1 for MSI cases. Is also worth noting that this script was designed to be run in a parallel computing environment. The microsatellites list file is broken up into chunks, and separate threads handle processing each chunk. The microsatellite instance vectors for each chunk (if any) are then aggregated to create the full sample. You can modify the number of concurrent slots by setting the ```--cores``` arg. An example lsf submission script is available in ```utils/multi-sample-vector-creation.lsf```
+The ```--is-labeled``` flag is especially important if you plan on using the built vectors for training and testing. If you're only evaluating a sample a default label of -1 will be used, but if you want to train/test you'll need to provide that information to the script so that a relevant loss can be calculated. Labels can be provided as a fourth column in the case list text file, and should take on a value of -1 for MSS cases and 1 for MSI cases. This script was designed to be run in a parallel computing environment. The microsatellites list file is broken up into chunks, and separate threads handle processing each chunk. The microsatellite instance vectors for each chunk (if any) are then aggregated to create the full sample. You can modify the number of concurrent slots by setting the ```--cores``` arg. 
 
 ### Evaluation
 If you have a directory of built vectors you can run just the evaluation step using the ```main/evaluate_samples.py``` script.
 
 ```
-cd /path/to/mimsi
-python -m main.evaluate_sample --saved-model ./model/mimsi_mskcc_impact_v0_2_0.model--vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
-```
-
-Or as command-line tool
-```
-evaluate_sample --saved-model ./model/mimsi_mskcc_impact_v0_2_0.model --vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
+evaluate_sample --saved-model ./model/mi_msi_v0_4_0_200x.model --vector-location /path/to/generated/vectors --save --name "eval_sample" > output_log.txt
 ```
 
 Just as with the ```analyze.py``` script output will be directed to standard out, and you can save the results to disk via the ```--save``` flag.
 
 ## Training the Model from Scratch
 We've included a pre-trained model file in the ```model/``` directory for evaluation, but if you'd like to try to build a new version from scratch with your own data we've also included our training and testing script in ```main/mi_msi_train_test.py```
 
-```
-cd /path/to/mimsi
-python -m main.mi_msi_train_test --name 'model_name' --train-location /path/to/train/vectors/ --test-location /path/to/train/vectors/  --save 1 > train_test.log
-```
 
-Or as command-line tool
 ```
 mi_msi_train_test --name 'model_name' --train-location /path/to/train/vectors/ --test-location /path/to/train/vectors/  --save 1 > train_test.log
 ```
 
-Just note that you will need labels for any vectors utilized in training and testing. This can be accomplished by setting the ```--labeled``` flag in the Vector Creation script (see section above) and including the labels as a final column in the vector generation case list. The weights of the best performing model will be saved as a Torch ".model" file based on the name specified in the command args. Additionally, key metrics (TPR, FPR, Thresholds, AUROC) and train/test loss information are saved to disk as a numpy array. These metrics are also printed to standard out if you'd prefer to capture the results that way. Training a dataset of ~300 cases with ~1000 instances per case takes approximately 5 hours on 2 Nvidia DGX1 GPUs, see ```utils/train_test.lsf``` for an example lsf submission script.
+Just note that you will need labels for any vectors utilized in training and testing. This can be accomplished by setting the ```--labeled``` flag in the Vector Creation script (see section above) and including the labels as a final column in the vector generation case list. The weights of the best performing model will be saved as a Torch ".model" file based on the name specified in the command args. Additionally, key metrics (TPR, FPR, Thresholds, AUROC) and train/test loss information are saved to disk as a numpy array. These metrics are also printed to standard out if you'd prefer to capture the results that way. 
 
 ## Questions, Comments, Collaborations?
 Please feel free to reach out, I'm available via email - [email protected]