Skip to content

Commit

Permalink
v2.4.3 - bug fixes, unit tests, docs
Browse files Browse the repository at this point in the history
  • Loading branch information
amckenna41 committed Nov 23, 2023
1 parent 8e9c0dc commit 9ad2ca6
Show file tree
Hide file tree
Showing 10 changed files with 65 additions and 67 deletions.
29 changes: 12 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,19 @@
<!-- [![DOI](https://zenodo.org/badge/344290370.svg)](https://zenodo.org/badge/latestdoi/344290370) -->
<!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->

`pySAR` is a Python library for analysing Sequence Activity Relationships (SARs)/Sequence Function Relationships (SFRs) of protein sequences.

* The published research article is available [here][article].
* A quick Colab notebook demo of `pySAR` is available [here][demo].
* A **Medium** article that dives deeper into SARs and the `pySAR` software itself is available [here][medium].

Table of Contents
=================
* [Introduction](#Introduction)
* [Requirements](#requirements)
* [Installation](#installation)
* [Usage](#usage)
* [Directories](#directories)
* [Tests](#tests)
* [Issues](#Issues)
* [Contact](#contact)
* [License](#license)
Expand All @@ -34,7 +39,7 @@ Table of Contents

Research Article
================
The research article that accompanied this software is titled: "Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors" and was published in the Journal of Biomedical Informatics and is available [here][article] [[1]](#references). There is also a quick <b>Colab notebook demo</b> of `pySAR` available [here][demo].
The research article that accompanied this software is titled: "Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors" and was published in the Journal of Biomedical Informatics and is available [here][article] [[1]](#references).

How to cite
===========
Expand All @@ -46,10 +51,12 @@ Introduction

After finding the optimal technique and feature set at which to numerically encode your dataset of sequences, `pySAR` can then be used to build a predictive regression ML model with the training data being that of the encoded protein sequences, and training labels being the in vitro experimentally pre-calculated activity values for each protein sequence. This model maps a set of protein sequences to the sought-after activity value, being able to accurately predict the activity/fitness value of new unseen sequences. The use-case for the software is within the field of Protein Engineering, Directed Evolution and or Drug Discovery, where a user has a set of in vitro experimentally determined activity/fitness values for a library of mutant protein sequences and wants to computationally predict the sought activity value for a selection of mutated unseen sequences, in the aim of finding the best sequence that minimises/maximises their activity value. <br>

In the published [research][article], the sought activity/fitness characterisitc is the thermostability of proteins from a recombination library designed from parental cytochrome P450's. This thermostability is measured using the T50 metric (temperature at which 50% of a protein is irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which we want to maximise [[1]](#references).
In the published [research][article], the sought activity/fitness characteristic is the thermostability of proteins from a recombination library designed from parental cytochrome P450's. This thermostability is measured using the T50 metric (temperature at which 50% of a protein is irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which we want to maximise [[1]](#references).

Two additional <strong>custom-built</strong> softwares were created alongside `pySAR` - [`aaindex`][aaindex] and [`protpy`][protpy]. The `aaindex` software package is used for parsing the amino acid index which is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids [[2]](#references). `protpy` is used for calculating a series of protein physiochemical, biochemical and structural protein descriptors. Both of these software packages are integrated into `pySAR` but can also be used individually for their respective purposes.

**A quick Colab notebook demo of `pySAR` is available [here][demo]. There is also a Medium article that dives deeper into SARs and the `pySAR` software itself, available [here][medium].**

Requirements
============
* [Python][python] >= 3.8
Expand Down Expand Up @@ -521,19 +528,6 @@ Issues
======
Any issues, errors or bugs can be raised via the [Issues](https://github.com/amckenna41/pySAR/issues) tab in the repository.

Tests
=====
To run all tests, from the main `pySAR` repo folder run:
```
python3 -m unittest discover tests
```

To run tests for specific module, from the main `pySAR` repo folder run:
```
python -m unittest tests.MODULE_NAME -v
-v: verbose output flag
```

Contact
=======
If you have any questions or comments, please contact [email protected] or raise an issue on the [Issues][Issues] tab. <br><br>
Expand Down Expand Up @@ -579,4 +573,5 @@ DOI: 10.1021/acs.jcim.0c00073 <br><br>
[demo]: https://colab.research.google.com/drive/1hxtnf8i4q13fB1_2TpJFimS5qfZi9RAo?usp=sharing
[Issues]: https://github.com/amckenna41/pySAR/issues
[license]: https://github.com/amckenna41/pySAR/blob/master/LICENSE
[config]: https://github.com/amckenna41/pySAR/blob/master/CONFIG.md
[config]: https://github.com/amckenna41/pySAR/blob/master/CONFIG.md
[medium]: https://ajmckenna69.medium.com/pysar-a3de9f71733f
8 changes: 5 additions & 3 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,8 +269,10 @@ To Do List:
- [X] Add info about the colunns and dimensions of each descriptors in pre-calculated csv file - fix Issue.
- [X] When calculating all descriptors (get_all_descriptors(export=True)), add some sort of print/tracking functionality.
- [X] Double check all links in readme.
- [ ] Add dimensions of each dataset to https://github.com/amckenna41/pySAR/tree/master/example_datasets.
- [X] Add dimensions of each dataset to https://github.com/amckenna41/pySAR/tree/master/example_datasets.
- [ ] Go over references in descriptors module - refer to protpy.
- [X] Update distance matrices in configs - test once protpy published.
- [ ] Add link to medium article.
- [X] Update aaindex version on readme.
- [X] Add link to medium article.
- [X] Update aaindex version on readme.
- [X] Add elapsed time for each case study - calculating protein descriptors on demo.
- [ ] readthedocs(https://github.com/MartinThoma/propy3/tree/master).
9 changes: 4 additions & 5 deletions example_datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@ Datasets
--------
* `thermostability.txt` - dataset studied in the associated work which consists of a dataset to measure the thermostability of various mutants
from a recombination library designed from parental cytochrome P450's, measured using the T50 metric (temperature at which 50% of a protein is
irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which represents the protein activity of this dataset [[1]](#references).
* `absorption.txt` - dataset of 80 blue and red-shifted protein variants of the Gloeobacter Violaceus Rhodopsin (GR) protein that were mutated and substituted to tune its peak absorption wavelength. 1-5 mutations were generated in the course of tuning its absorption wavelength, for a total of 81 sequences, with the peak being captured as each sequence's activity ranging from values of 454 to 622 [[2]](#references).
* `enantioselectivity.txt` - dataset consisting of 37 mutants and one WT (wild-type) sequence from the Aspergillus Niger organism and their calculated enantioselectivity. Enantioselectivity refers to the selectivity of a reaction towards one enantiomer and is expressed by the E-value with a range between 0 and 115 [[3]](#references).
* `localization.txt` - dataset made up of 248 sequences made up of 2 seperate, 10-block recombination libraries that were designed from 3 parental ChR's (channelrhodopsin). Each chimeric ChR variant in these libraries consist of blocks of sequences from parental ChRs. Genes for these sequences were synthesized and expressed in human embryonic kidney (HEK) cells, and their membrane localization was measured as log_GFP ranging from values of -9.513 to 105 [[4]](#references).

irreversibly denatured after 10 mins of incubation, ranging from 39.2 to 64.4 degrees C), which represents the protein activity of this dataset [[1]](#references). **Dataset dimensions: 261 x 5**.
* `absorption.txt` - dataset of 80 blue and red-shifted protein variants of the Gloeobacter Violaceus Rhodopsin (GR) protein that were mutated and substituted to tune its peak absorption wavelength. 1-5 mutations were generated in the course of tuning its absorption wavelength, for a total of 81 sequences, with the peak being captured as each sequence's activity ranging from values of 454 to 622 [[2]](#references). **Dataset dimensions: 81 x 5**.
* `enantioselectivity.txt` - dataset consisting of 37 mutants and one wild-type (WT) sequence from the Aspergillus Niger organism and their calculated enantioselectivity. Enantioselectivity refers to the selectivity of a reaction towards one enantiomer and is expressed by the E-value with a range between 0 and 115 [[3]](#references). **Dataset dimensions: 152 x 5**.
* `localization.txt` - dataset made up of 248 sequences made up of 2 separate, 10-block recombination libraries that were designed from 3 parental channelrhodopsin (ChRs). Each chimeric ChR variant in these libraries consist of blocks of sequences from parental ChRs. Genes for these sequences were synthesized and expressed in human embryonic kidney (HEK) cells, and their membrane localization was measured as log_GFP ranging from values of -9.513 to 105 [[4]](#references). **Dataset dimensions: 254 x 5**.
* `descriptors_absorption.csv` - pre-calculated protein descriptors using sequences from absorption test dataset. The dimensions for this csv are 81 x 9714 (81 protein sequences and 9714 features), when using default parameters as in the config file.
* `descriptors_enantioselectivity.csv` - pre-calculated protein descriptors using sequences from enantioselectivity test dataset. The dimensions for this csv are 152 x 9714 (152 protein sequences and 9714 features), when using default parameters as in the config file.
* `descriptors_localization.csv` - pre-calculated protein descriptors using sequences from localization test dataset. The dimensions for this csv are 254 x 9714 (254 protein sequences and 9714 features), when using default parameters as in the config file.
Expand Down
2 changes: 1 addition & 1 deletion pySAR/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
""" pySAR software metadata. """
__name__ = 'pySAR'
__version__ = "2.4.2"
__version__ = "2.4.3"
__description__ = 'A Python package used to analysis Sequence Activity Relationships (SARs) of protein sequences and their mutants using Machine Learning.'
__author__ = 'AJ McKenna: https://github.com/amckenna41'
__authorEmail__ = '[email protected]'
Expand Down
5 changes: 3 additions & 2 deletions pySAR/descriptors.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ def __init__(self, config_file="", protein_seqs=None, **kwargs):

self.config_file = config_file
self.protein_seqs = protein_seqs
self.kwargs = locals()['kwargs'] #get any keyword argument variables of class
self.config_parameters = {}

desc_config_filepath = ""
Expand Down Expand Up @@ -132,8 +133,8 @@ def __init__(self, config_file="", protein_seqs=None, **kwargs):
self.desc_parameters = Map(self.config_parameters["descriptors"])

#set dataset and descriptors csv filepath from kwargs, if applicable, or the config file values
self.dataset_filepath = kwargs.get('dataset_filepath') if 'dataset_filepath' in kwargs else self.dataset_parameters["dataset"]
self.descriptors_csv = kwargs.get('descriptors_csv') if 'descriptors_csv' in kwargs else self.desc_parameters.descriptors_csv
self.dataset_filepath = self.kwargs.get('dataset') if 'dataset' in self.kwargs else self.dataset_parameters["dataset"]
self.descriptors_csv = self.kwargs.get('descriptors_csv') if 'descriptors_csv' in self.kwargs else self.desc_parameters.descriptors_csv

#import protein sequences from dataset if not directly specified in protein_seqs input param
if not (isinstance(self.protein_seqs, pd.Series)):
Expand Down
12 changes: 6 additions & 6 deletions pySAR/encoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ def aai_encoding(self, aai_indices=None, sort_by='R2', output_folder=""):
#generate protein spectra from pyDSP class if use_dsp is true, pass in all DSP related parameters, use object as training data
if (self.use_dsp):
pyDSP = PyDSP(self.config_file, protein_seqs=encoded_seqs, spectrum=self.spectrum, window_type=self.window_type, filter_type=self.filter_type)
pyDSP.encode_seqs()
pyDSP.encode_sequences()
X = pd.DataFrame(pyDSP.spectrum_encoding)
else:
#aai index encoding set as training data
Expand Down Expand Up @@ -336,8 +336,8 @@ def descriptor_encoding(self, descriptors=[], desc_combo=1, sort_by='R2', output
mae_ = []
explained_var_ = []

#create instance of descriptors class using config file
desc = Descriptors(self.config_file)
#create instance of descriptors class using config file and any kwargs
desc = Descriptors(self.config_file, **self.kwargs)

#if no descriptors passed into descriptors input param then use all descriptors by default,
#get list of all descriptors according to desc_combo value
Expand Down Expand Up @@ -607,8 +607,8 @@ def aai_descriptor_encoding(self, aai_indices=[], descriptors=[], desc_combo=1,
if not (index in aaindex1.record_codes()):
raise ValueError("AAI record {} not found in list of available record codes.".format(index))

#create instance of Descriptors class
desc = Descriptors(config_file=self.config_file)
#create instance of Descriptors class using config file and any kwargs
desc = Descriptors(config_file=self.config_file, **self.kwargs)

#raise error if invalid parameter data types input
if ((not isinstance(descriptors, list)) and (not isinstance(descriptors, str))):
Expand Down Expand Up @@ -688,7 +688,7 @@ def aai_descriptor_encoding(self, aai_indices=[], descriptors=[], desc_combo=1,
#generate protein spectra from pyDSP class if use_dsp is true, pass in all DSP related parameters, use object as training data
if (self.use_dsp):
pyDSP = PyDSP(self.config_file, protein_seqs=encoded_seqs, spectrum=self.spectrum, window_type=self.window_type, filter_type=self.filter_type)
pyDSP.encode_seqs()
pyDSP.encode_sequences()
X_aai = pd.DataFrame(pyDSP.spectrum_encoding)
else:
X_aai = pd.DataFrame(encoded_seqs)
Expand Down
Loading

0 comments on commit 9ad2ca6

Please sign in to comment.