Skip to content

We worked with an open csv-dataset which consist on RNA sequences with several taxonomies. Using python we were able to create an XGBoost model that classifies that sequence into 1 of 19 differents taxonomies. We also worked with Markov chains in order to treat the data.

Notifications You must be signed in to change notification settings

santiagoahl/RNA-taxonomy-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


WHR
RNA Taxonomy Classification

An XGBoost Multiclass classifier built in scikit-learn using Markov Chains.

scikit-learn Numpy joblib json

Key FeaturesHow To UseCreditsLicense

screenshot

Key Features

  • This machine learning model takes a RNA sequence and predicts what class does it belong to. Classes are taken as taxonomies. The avaible taxonomies are the following 19:

    • Orthomyxoviridae
    • Rhabdoviridae
    • Arteriviridae
    • Coronaviridae
    • Reoviridae
    • Caliciviridae
    • Phenuiviridae
    • Hantaviridae
    • Picornaviridae
    • Betaflexiviridae
    • Astroviridae
    • Closteroviridae
    • Flaviviridae
    • Potyviridae
    • Retroviridae
    • Togaviridae
    • Paramyxoviridae
    • Hepeviridae
    • Pneumoviridae
  • Before Prediction the model computes a markov chain whose states are the 64 writeable codons with the nucleoids A, C, G, T and then computes metrics over its adjacent associated matrix: 8 of them are matricial norms and the missing 10 parameters are the first eigenvalues complex norms ascending ordered. Namely:

    • Frobenius Norm
    • Nuclear Norm
    • Infty Norm
    • Neg Infty Norm
    • Neg L1 Norm
    • L1 Norm
    • Neg L2 Norm
    • L2 Norm
    • eig 1
    • eig 2
    • eig 3
    • eig 4
    • eig 5
    • eig 6
    • eig 7
    • eig 8
    • eig 9
    • eig 10

With these new metrics, we built a new dataset. and we found this scatter plot: screenshot

  • We implemented a Random Forest model whose train data is taken from the new dataset. screenshot
  • We archieved a 96.9% of F1 score on validation set.
  • The confusion matrix is the following

screenshot

  • The learning curve is the following screenshot

How To Use

To clone and run this application, follow these steps

# Clone this repository
$ git clone https://github.com/santiagoahl/rna-taxonomy-prediction.git

# Go into the repository
$ cd rna-taxonomy-prediction

# Go to jupyter notebooks
$ jupyter-notebook

# Run the Libraries & Modules cell
# Run the Model Import cell

Credits

This software uses the following packages:

License

MIT


Web Site santiagoal.super.site  ·  GitHub @santiagoahl  ·  Twitter @sahumadaloz

About

We worked with an open csv-dataset which consist on RNA sequences with several taxonomies. Using python we were able to create an XGBoost model that classifies that sequence into 1 of 19 differents taxonomies. We also worked with Markov chains in order to treat the data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published