The miRNA expression quantification data is used https://gdc.cancer.gov/about to predict site of origin of the disease.
The CNN model is trained on Titan Xp GPU and Amazon Sagemaker.This model is based on Keras with tensorflow backend.The results are compared to a simple SVM and Logistic Regression.
1.Python 3.6
2.Tensorflow 1.5
3.Keras
4.Scikit Learn
5.pandas
I.Download
Expression Quantification data:
miRNA sequence data
1.Go to the data portal https://portal.gdc.cancer.gov/repository, on the left side there are two
tabs: Files and Cases
2.Click Cases and select a disease type: Liver Hepatocellular Carcinoma
3.Click Files and select Data Category: Transcriptome Profiling Data type:
miRNA Expression Quantification
Experimental Strategy:
miRNA-Seq
II.Click on the Manifest download. This will download the manifest file for use with GDC data transfer tool.
III.Data transfer tool Download: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
Download the version according to your OS type.
Command line to downloadand unzip a OSX version:
Download:
wget -c -t 0 https://gdc.cancer.gov/files/public/file/gdc-client_v1.3.0_OSX_x64.zip
Unzip:Unzip gdc-client_v1.3.0_OSX_x64.zip
Download with gdc-client../<path-to-gdc-client>/gdc-client download –m <path-to-manifest-file>
e.g. ./~/Downloads/gdc-client –m ~/Downloads/gdc_manifest.2018-08-23.txt
1.Run check.py for checking if all the files have been downloaded.To download remaining ones use:
./<path-to-gdc-client>/gdc-client download <id>
e.g. ./gdc-client download fa63ce14-b9b5-4041-9df7-3b86ba9ede16
2.Use parse_file_case_id.py to extract file and case ID's for further use.
3.Download the JSON format of the Data from the same website.Use parser.py for for parsing the data into labels and data.
The data is converted into 10 Numpy arrays
batch_1.npy
batch_2.npy...
batch_10.npy each of shape(1000 x 1822)(i.e 1000 samples each with 1822 features)
Labels are Place of origin and are one hot coded.
These numpy arrays are feeded in batches using generator to CNN Model.
As the the distribution of classes is uneven,we put a hard threshold of 100.
We excluded classes with less than 500 samples.
Feel free to change the threshold as per your model.
Distribution before threshold:
Example of Sparse Distribution:
Distribution after threshold 100:(26 classes only )
Distribution after threshold 500:(5 classes only )
Results with CNN:
Result with SVM:
As you see simple classifier with SVM or even CNN do not work.As next step we need to analyze the data more.The basic problem is that we have only 3557 samples for these 5 classes ,each with 1881 input parameters.This is very hard to converge.
Selecting only some of these factors and probably increasing the depth of CNN may be the next steps.
- Siddharth Bhonge - Parser /Model - https://github.com/siddharthbhonge