Learning Drug Functions from Chemical Structures with Convolutional Neural Networks and Random Forests. Jesse G Meyer, Shengchao Liu, Ian J Miller, Joshua J Coon, Anthony Gitter Journal of Chemical Information and Modeling. 2019, 59(10) 4438-4449.
The folder 'MFP_RF' contains the code for the molecular fingerprint + random forests models, and the folder 'IMG_CNN' contains the code for the images and convolutional neural network models.
All the data is available in the 'small_data_676_chems' folder, including:
- Notebook with data preparation 'aliper_small_data_prep.ipynb'
- Lists of each set of molecule SMILES strings split into .csv files for each class "[class]_smiles_rmsalt.csv".
- Pictures of each molecule in folders by class 'small_data_676_chems/pics'
- Lists of data for 3, 5, or 12 class problems with the path to their picture, their class, and their SMILES string. These files are used with Fast.ai for training: 3cls_aliper.csv, 5cls_aliper.csv, 12cls_aliper.csv.
- Files containing the validation set indexes refering to lines in the files described above in #4: '3cls_aliper_10fold[1-9].csv', '5cls_aliper_10fold[1-9].csv', '12cls_aliper_10fold[1-9].csv'
All the data is available from the 'data' folder, including:
- Raw lists downloaded from pubchem (data/frompubchem/),
- SMILES strings by class (data/SMILES/)
- Pictures organized in folders by class (data/pics/)
Files containing the examples (path to png and SMILES string) and their class annotation are in data/pics/
- data/pics/12cls_rmsaltol.csv
- data/pics/5cls_rmsaltol.csv
- data/pics/3cls_rmsaltol.csv
The file containing molecular fingerprints is: data/fingerprints_12cls.csv
The indexes of validation lines for the 5 validation sets in the above-mentioned example/class list files are in the same folder and named by the subtask:
- 12cls_val_ids[0-4].csv
- 5cls_val_ids[0-4].csv
- 3cls_val_ids[0-4].csv
All multi-class data, including the method to perform multi-class splitting, is available under 'multiclass_data'
- all_chem_df.csv is the master list with all the info for the chemicals
- get_data.ipynb is used to get the chemicals from the original downloads with CIDs from pubchem. Also cleans the multiclass data to remove salts and remove repeated chemicals
- multiclass_5foldCV.ipynb contains the training loop where metrics are computed
- multiclass_data/pics contains the actual chemical images refered to in 'all_chem_df.csv'
- the python environment for multiclass data was different and uses fasta version 1. The file fastai_v1.yml contains the environment info needed to recreate
- as for the other 2 CNN models, the validation indexes refering to lines in all_chem_df.csv are given in multilabel_iter5fold_[0-4].csv
- multiclass_data/figures gives the network analysis figures