v2.4.3 - bug fixes, unit tests, docs

amckenna41 · Nov 23, 2023 · 9ad2ca6 · 9ad2ca6
1 parent 8e9c0dc
commit 9ad2ca6
Show file tree

Hide file tree

Showing 10 changed files with 65 additions and 67 deletions.
diff --git a/README.md b/README.md
@@ -18,14 +18,19 @@
 <!-- [![DOI](https://zenodo.org/badge/344290370.svg)](https://zenodo.org/badge/latestdoi/344290370) -->
 <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->
 
+`pySAR` is a Python library for analysing Sequence Activity Relationships (SARs)/Sequence Function Relationships (SFRs) of protein sequences. 
+
+* The published research article is available [here][article].
+* A quick Colab notebook demo of `pySAR` is available [here][demo]. 
+* A **Medium** article that dives deeper into SARs and the `pySAR` software itself is available [here][medium].
+
 Table of Contents
 =================
   * [Introduction](#Introduction)
   * [Requirements](#requirements)
   * [Installation](#installation)
   * [Usage](#usage)
   * [Directories](#directories)
-  * [Tests](#tests)
   * [Issues](#Issues)
   * [Contact](#contact)
   * [License](#license)
@@ -34,7 +39,7 @@ Table of Contents
 
 Research Article
 ================
-The research article that accompanied this software is titled: "Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors" and was published in the Journal of Biomedical Informatics and is available [here][article] [[1]](#references). There is also a quick <b>Colab notebook demo</b> of `pySAR` available [here][demo].
+The research article that accompanied this software is titled: "Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors" and was published in the Journal of Biomedical Informatics and is available [here][article] [[1]](#references).
 
 How to cite
 ===========
@@ -46,10 +51,12 @@ Introduction
 
 After finding the optimal technique and feature set at which to numerically encode your dataset of sequences, `pySAR` can then be used to build a predictive regression ML model with the training data being that of the encoded protein sequences, and training labels being the in vitro experimentally pre-calculated activity values for each protein sequence. This model maps a set of protein sequences to the sought-after activity value, being able to accurately predict the activity/fitness value of new unseen sequences. The use-case for the software is within the field of Protein Engineering, Directed Evolution and or Drug Discovery, where a user has a set of in vitro experimentally determined activity/fitness values for a library of mutant protein sequences and wants to computationally predict the sought activity value for a selection of mutated unseen sequences, in the aim of finding the best sequence that minimises/maximises their activity value. <br>
 
-In the published [research][article], the sought activity/fitness characterisitc is the thermostability of proteins from a recombination library designed from parental cytochrome P450's. This thermostability is measured using the T50 metric (temperature at which 50% of a protein is irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which we want to maximise [[1]](#references).
+In the published [research][article], the sought activity/fitness characteristic is the thermostability of proteins from a recombination library designed from parental cytochrome P450's. This thermostability is measured using the T50 metric (temperature at which 50% of a protein is irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which we want to maximise [[1]](#references).
 
 Two additional <strong>custom-built</strong> softwares were created alongside `pySAR` - [`aaindex`][aaindex] and [`protpy`][protpy]. The `aaindex` software package is used for parsing the amino acid index which is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids [[2]](#references). `protpy` is used for calculating a series of protein physiochemical, biochemical and structural protein descriptors. Both of these software packages are integrated into `pySAR` but can also be used individually for their respective purposes. 
 
+**A quick Colab notebook demo of `pySAR` is available [here][demo]. There is also a Medium article that dives deeper into SARs and the `pySAR` software itself, available [here][medium].** 
+
 Requirements
 ============
 * [Python][python] >= 3.8
@@ -521,19 +528,6 @@ Issues
 ======
 Any issues, errors or bugs can be raised via the [Issues](https://github.com/amckenna41/pySAR/issues) tab in the repository.
 
-Tests
-=====
-To run all tests, from the main `pySAR` repo folder run:
-```
-python3 -m unittest discover tests
-```
-
-To run tests for specific module, from the main `pySAR` repo folder run:
-```
-python -m unittest tests.MODULE_NAME -v
--v: verbose output flag
-```
-
 Contact
 =======
 If you have any questions or comments, please contact [email protected] or raise an issue on the [Issues][Issues] tab. <br><br>
@@ -579,4 +573,5 @@ DOI: 10.1021/acs.jcim.0c00073 <br><br>
 [demo]: https://colab.research.google.com/drive/1hxtnf8i4q13fB1_2TpJFimS5qfZi9RAo?usp=sharing
 [Issues]: https://github.com/amckenna41/pySAR/issues
 [license]: https://github.com/amckenna41/pySAR/blob/master/LICENSE
-[config]: https://github.com/amckenna41/pySAR/blob/master/CONFIG.md
+[config]: https://github.com/amckenna41/pySAR/blob/master/CONFIG.md
+[medium]: https://ajmckenna69.medium.com/pysar-a3de9f71733f
diff --git a/TODO.md b/TODO.md
@@ -269,8 +269,10 @@ To Do List:
 - [X] Add info about the colunns and dimensions of each descriptors in pre-calculated csv file - fix Issue.
 - [X] When calculating all descriptors (get_all_descriptors(export=True)), add some sort of print/tracking functionality.
 - [X] Double check all links in readme.
-- [ ] Add dimensions of each dataset to https://github.com/amckenna41/pySAR/tree/master/example_datasets.
+- [X] Add dimensions of each dataset to https://github.com/amckenna41/pySAR/tree/master/example_datasets.
 - [ ] Go over references in descriptors module - refer to protpy.
 - [X] Update distance matrices in configs - test once protpy published.
-- [ ] Add link to medium article.
-- [X] Update aaindex version on readme.
+- [X] Add link to medium article.
+- [X] Update aaindex version on readme.
+- [X] Add elapsed time for each case study - calculating protein descriptors on demo.
+- [ ] readthedocs(https://github.com/MartinThoma/propy3/tree/master).
diff --git a/example_datasets/README.md b/example_datasets/README.md
@@ -4,11 +4,10 @@ Datasets
 --------
 * `thermostability.txt` - dataset studied in the associated work which consists of a dataset to measure the thermostability of various mutants
 from a recombination library designed from parental cytochrome P450's, measured using the T50 metric (temperature at which 50% of a protein is
-irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which represents the protein activity of this dataset [[1]](#references).
-* `absorption.txt` - dataset of 80 blue and red-shifted protein variants of the Gloeobacter Violaceus Rhodopsin (GR) protein that were mutated and substituted to tune its peak absorption wavelength. 1-5 mutations were generated in the course of tuning its absorption wavelength, for a total of 81 sequences, with the peak being captured as each sequence's activity ranging from values of 454 to 622 [[2]](#references).
-* `enantioselectivity.txt` - dataset consisting of 37 mutants and one WT (wild-type) sequence from the Aspergillus Niger organism and their calculated enantioselectivity. Enantioselectivity refers to the selectivity of a reaction towards one enantiomer and is expressed by the E-value with a range between 0 and 115 [[3]](#references).
-* `localization.txt` - dataset made up of 248 sequences made up of 2 seperate, 10-block recombination libraries that were designed from 3 parental ChR's (channelrhodopsin). Each chimeric ChR variant in these libraries consist of blocks of sequences from parental ChRs. Genes for these sequences were synthesized and expressed in human embryonic kidney (HEK) cells, and their membrane localization was measured as log_GFP ranging from values of -9.513 to 105 [[4]](#references).
-
+irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which represents the protein activity of this dataset [[1]](#references). **Dataset dimensions: 261 x 5**.
+* `absorption.txt` - dataset of 80 blue and red-shifted protein variants of the Gloeobacter Violaceus Rhodopsin (GR) protein that were mutated and substituted to tune its peak absorption wavelength. 1-5 mutations were generated in the course of tuning its absorption wavelength, for a total of 81 sequences, with the peak being captured as each sequence's activity ranging from values of 454 to 622 [[2]](#references). **Dataset dimensions: 81 x 5**.
+* `enantioselectivity.txt` - dataset consisting of 37 mutants and one wild-type (WT) sequence from the Aspergillus Niger organism and their calculated enantioselectivity. Enantioselectivity refers to the selectivity of a reaction towards one enantiomer and is expressed by the E-value with a range between 0 and 115 [[3]](#references). **Dataset dimensions: 152 x 5**.
+* `localization.txt` - dataset made up of 248 sequences made up of 2 separate, 10-block recombination libraries that were designed from 3 parental channelrhodopsin (ChRs). Each chimeric ChR variant in these libraries consist of blocks of sequences from parental ChRs. Genes for these sequences were synthesized and expressed in human embryonic kidney (HEK) cells, and their membrane localization was measured as log_GFP ranging from values of -9.513 to 105 [[4]](#references). **Dataset dimensions: 254 x 5**.
 * `descriptors_absorption.csv` - pre-calculated protein descriptors using sequences from absorption test dataset. The dimensions for this csv are 81 x 9714 (81 protein sequences and 9714 features), when using default parameters as in the config file.
 * `descriptors_enantioselectivity.csv` - pre-calculated protein descriptors using sequences from enantioselectivity test dataset. The dimensions for this csv are 152 x 9714 (152 protein sequences and 9714 features), when using default parameters as in the config file.
 * `descriptors_localization.csv` - pre-calculated protein descriptors using sequences from localization test dataset. The dimensions for this csv are 254 x 9714 (254 protein sequences and 9714 features), when using default parameters as in the config file.

diff --git a/pySAR/__init__.py b/pySAR/__init__.py
@@ -1,6 +1,6 @@
 """ pySAR software metadata. """
 __name__ = 'pySAR'
-__version__ = "2.4.2"
+__version__ = "2.4.3"
 __description__ = 'A Python package used to analysis Sequence Activity Relationships (SARs) of protein sequences and their mutants using Machine Learning.'
 __author__ = 'AJ McKenna: https://github.com/amckenna41'
 __authorEmail__ = '[email protected]'

diff --git a/pySAR/descriptors.py b/pySAR/descriptors.py
@@ -104,6 +104,7 @@ def __init__(self, config_file="", protein_seqs=None, **kwargs):
 
         self.config_file = config_file
         self.protein_seqs = protein_seqs
+        self.kwargs = locals()['kwargs'] #get any keyword argument variables of class
         self.config_parameters = {}
 
         desc_config_filepath = ""
@@ -132,8 +133,8 @@ def __init__(self, config_file="", protein_seqs=None, **kwargs):
         self.desc_parameters = Map(self.config_parameters["descriptors"])
 
         #set dataset and descriptors csv filepath from kwargs, if applicable, or the config file values
-        self.dataset_filepath = kwargs.get('dataset_filepath') if 'dataset_filepath' in kwargs else self.dataset_parameters["dataset"]
-        self.descriptors_csv = kwargs.get('descriptors_csv') if 'descriptors_csv' in kwargs else self.desc_parameters.descriptors_csv
+        self.dataset_filepath = self.kwargs.get('dataset') if 'dataset' in self.kwargs else self.dataset_parameters["dataset"]
+        self.descriptors_csv = self.kwargs.get('descriptors_csv') if 'descriptors_csv' in self.kwargs else self.desc_parameters.descriptors_csv
 
         #import protein sequences from dataset if not directly specified in protein_seqs input param
         if not (isinstance(self.protein_seqs, pd.Series)):

diff --git a/pySAR/encoding.py b/pySAR/encoding.py
@@ -200,7 +200,7 @@ def aai_encoding(self, aai_indices=None, sort_by='R2', output_folder=""):
             #generate protein spectra from pyDSP class if use_dsp is true, pass in all DSP related parameters, use object as training data
             if (self.use_dsp):
                 pyDSP = PyDSP(self.config_file, protein_seqs=encoded_seqs, spectrum=self.spectrum, window_type=self.window_type, filter_type=self.filter_type)
-                pyDSP.encode_seqs()
+                pyDSP.encode_sequences()
                 X = pd.DataFrame(pyDSP.spectrum_encoding)
             else:
                 #aai index encoding set as training data
@@ -336,8 +336,8 @@ def descriptor_encoding(self, descriptors=[], desc_combo=1, sort_by='R2', output
         mae_ = []
         explained_var_ = []
 
-        #create instance of descriptors class using config file
-        desc = Descriptors(self.config_file)
+        #create instance of descriptors class using config file and any kwargs
+        desc = Descriptors(self.config_file, **self.kwargs)
 
         #if no descriptors passed into descriptors input param then use all descriptors by default,
         #get list of all descriptors according to desc_combo value
@@ -607,8 +607,8 @@ def aai_descriptor_encoding(self, aai_indices=[], descriptors=[], desc_combo=1,
             if not (index in aaindex1.record_codes()):
                 raise ValueError("AAI record {} not found in list of available record codes.".format(index))
 
-        #create instance of Descriptors class
-        desc = Descriptors(config_file=self.config_file)
+        #create instance of Descriptors class using config file and any kwargs
+        desc = Descriptors(config_file=self.config_file, **self.kwargs)
 
         #raise error if invalid parameter data types input
         if ((not isinstance(descriptors, list)) and (not isinstance(descriptors, str))):
@@ -688,7 +688,7 @@ def aai_descriptor_encoding(self, aai_indices=[], descriptors=[], desc_combo=1,
             #generate protein spectra from pyDSP class if use_dsp is true, pass in all DSP related parameters, use object as training data
             if (self.use_dsp):
                 pyDSP = PyDSP(self.config_file, protein_seqs=encoded_seqs, spectrum=self.spectrum, window_type=self.window_type, filter_type=self.filter_type)
-                pyDSP.encode_seqs()
+                pyDSP.encode_sequences()
                 X_aai = pd.DataFrame(pyDSP.spectrum_encoding)
             else:
                 X_aai = pd.DataFrame(encoded_seqs)