This repository showcases the core functionalities of giotto-learn, a Python library for topological machine learning. The accompanying blog post can be found here.
This demo is based on the Predicting Molecular Properties competition on Kaggle, where the task is to predict the bond strength between atoms in molecules.
The easiest way to get started is to create a conda environment as follows:
conda create python=3.7 --name molecule -y
conda activate molecule
pip install -r requirements.txt
The scoring function is described on Kaggle and is calculated as follows:
where:
- is the number of coupling types
- is the number of observations of type t
- is the actual coupling value for this sample
- is the predicted coupling value for this sample
The figure below summarizes the results and gives a comparison of the results with and without topological features.
The following Kaggle notebooks were used for this project:
- For non-topological features: https://www.kaggle.com/robertburbidge/distance-features
- For plotting molecules (but adapted): https://www.kaggle.com/mykolazotko/3d-visualization-of-molecules-with-plotly
To get an introduction to the application of topological data analysis to machine learning, see:
- An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists: https://arxiv.org/pdf/1710.04019.pdf
The idea to use topological data analysis for predictions on molecules is not new. Below you can find some interesting papers related to this:
- Persistent-Homology-based Machine Learning and its Applications – A Survey: https://arxiv.org/abs/1811.00252 (esp. section 5)
- Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening: https://arxiv.org/pdf/1708.08135.pdf
The following papers were used to get some inspiration for the feature creation:
- The Ring of Algebraic Functions on Persistence Bar Codes: https://arxiv.org/pdf/1304.0530.pdf
- A topological approach for protein classification: https://arxiv.org/pdf/1510.00953.pdf