miRNA Sequences to predict Primary Site/Origin of Cancer

The miRNA expression quantification data is used https://gdc.cancer.gov/about to predict site of origin of the disease.

Getting Started

The CNN model is trained on Titan Xp GPU and Amazon Sagemaker.This model is based on Keras with tensorflow backend.The results are compared to a simple SVM and Logistic Regression.

Prerequisites

1.Python 3.6
2.Tensorflow 1.5
3.Keras
4.Scikit Learn
5.pandas

Installing

I.Download Expression Quantification data: miRNA sequence data
1.Go to the data portal https://portal.gdc.cancer.gov/repository, on the left side there are two tabs: Files and Cases
2.Click Cases and select a disease type: Liver Hepatocellular Carcinoma
3.Click Files and select Data Category: Transcriptome Profiling Data type: miRNA Expression Quantification
Experimental Strategy: miRNA-Seq

II.Click on the Manifest download. This will download the manifest file for use with GDC data transfer tool.
III.Data transfer tool Download: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
Download the version according to your OS type.
Command line to downloadand unzip a OSX version:
Download:


wget -c -t 0 https://gdc.cancer.gov/files/public/file/gdc-client_v1.3.0_OSX_x64.zip
Unzip:Unzip gdc-client_v1.3.0_OSX_x64.zip


Download with gdc-client../<path-to-gdc-client>/gdc-client download –m <path-to-manifest-file>
e.g. ./~/Downloads/gdc-client –m ~/Downloads/gdc_manifest.2018-08-23.txt

Running the tests

1.Run check.py for checking if all the files have been downloaded.To download remaining ones use:


./<path-to-gdc-client>/gdc-client download <id>
e.g. ./gdc-client download fa63ce14-b9b5-4041-9df7-3b86ba9ede16

2.Use parse_file_case_id.py to extract file and case ID's for further use.

3.Download the JSON format of the Data from the same website.Use parser.py for for parsing the data into labels and data.
The data is converted into 10 Numpy arrays

batch_1.npy
batch_2.npy...

batch_10.npy each of shape(1000 x 1822)(i.e 1000 samples each with 1822 features)

Labels are Place of origin and are one hot coded.

Deployment

These numpy arrays are feeded in batches using generator to CNN Model.

As the the distribution of classes is uneven,we put a hard threshold of 100.
We excluded classes with less than 500 samples.
Feel free to change the threshold as per your model.

Distribution before threshold:

Example of Sparse Distribution:

Distribution after threshold 100:(26 classes only )

Distribution after threshold 500:(5 classes only )

CNN Architecture:

Results with CNN:

Result with SVM:

As you see simple classifier with SVM or even CNN do not work.As next step we need to analyze the data more.The basic problem is that we have only 3557 samples for these 5 classes ,each with 1881 input parameters.This is very hard to converge.
Selecting only some of these factors and probably increasing the depth of CNN may be the next steps.

Authors

Siddharth Bhonge - Parser /Model - https://github.com/siddharthbhonge

Acknowledgments

Yue Shi https://github.com/yuesOctober/GDCproject/tree/yue

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
src		src
1.png		1.png
2.png		2.png
3.png		3.png
4.png		4.png
5.png		5.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miRNA Sequences to predict Primary Site/Origin of Cancer

Getting Started

Prerequisites

Installing

Running the tests

Deployment

Authors

Acknowledgments

About

Releases

Packages

Languages

siddharthbhonge/machine_learning_for_cancer_research

Folders and files

Latest commit

History

Repository files navigation

miRNA Sequences to predict Primary Site/Origin of Cancer

Getting Started

Prerequisites

Installing

Running the tests

Deployment

Authors

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages