v2.0.0
Change Log
From v1.3.1 to v2.0.0
Fixes
- more robust error handling of invalid molecules in
MoleculeTable
- Not all scorers in
supported_scoring
were actually working in the multi-class case, the scorer support is now
divided by single and multiclass support (moved tometrics.py
, see also New Features). - Instead of all smiles, only invalid smiles are now printed to the log when they are removed.
- problems with PaDEL descriptors and fingerprints on Linux were fixed
- fixed serialization issues with
DataFrameDescriptorSet
and saving and loading of MSA for PCM descriptor calculations - the Papyrus adapter was fixed so that the quality and data set filtering options work properly (before only high quality Papyrus++ data was fetched no matter the options)
- previously, in some cases cross-validation splits might not have been shuffled during hyperparameter optimization and evaluation on cross-validation folds (this might have resulted in suboptimal cross-validation performance and bad choices of hyperparameters), a fix was made in b029e78
- score_func can now be set in
QSPRModel
.
Changes
- Hyperparameter optimization moved to a separate class from
QSPRModel.bayesOptimization
andQSPRModel.gridSearch
toOptunaOptimization
andGridSearchOptimization
in the new moduleqsprpred.models.param_optimzation
with a base claseHyperParameterOptimization
inqsprpred.models.interfaces
. ⚠️ Important!⚠️ QSPRModel
attributemodel
now calledestimator
, which is always an instance ofalg
, whilealg
may no longer be an instance but only a Type.- Converting input data for
qsprpred.models.neural_network.Base
to dataloaders now executed in thefit
andpredict
functions instead of in theqspred.deep.models.QSPRDNN
class. MoleculeTable
now uses a custom index. When aMoleculeTable
is created a new column (QSPRID
) is added (overwritten if already present), which is then used as the index of the underlying data frame.- It is possible to override this with a custom index by passing
index_cols
to theMoleculeTable
constructor. These columns will be then used as index or a multi-index if more than one column is passed. - Due to this change,
scaffoldsplit
now uses these IDs instead of unreliable SMILES strings (see documentation for the new API).
- It is possible to override this with a custom index by passing
- If there are invalid molecules in
MoleculeTable
,addDescriptors
now fails by default. You can disable this by passingfail_on_invalid=False
to the method. - To support multitask modelling, the representation of the target in the
QSPRdataset
has changed to a list of
TargetProperty
s (see New Features). These can be automatically initizalid from dictionaries in theQSPRdataset
init. - A
fill_value
argument was also added to thepredict_CLI
script to allow for filling missing values in the
prediction data set as well. ⚠️ Important!⚠️ setup.py
andsetup.cfg
were substituted withpyproject.toml
andMANIFEST.in
. A lighter version of the package is now the default installation option!!!- Installation options for the optional dependencies are described in README.md
- CI scripts were modified to test the package on the full version. See changes in
.gitlab-ci.yml
. - Features using the extra dependencies were moved to
qsprpred.extra
andqsprpred.deep
subpackages. The structure of the subpackages is the same as of the main package, so you just need to remember to useqsprpred.extra
orqsprpred.deep
instead of justqsprpred
in your imports if you were using these features from the main package before.
- The way descriptors are stored in
MoleculeTable
was changed. They now reside in their ownDescriptorTable
instances that are linked to the orginalMoleculeTable
- This change was made to allow several types of descriptors to be calculated and used efficiently (facilitated by a the
DescriptorsCalculators
interface) - Unfortunately, this change is not backwards compatible, so previously pickled
MoleculeTable
instances will not work with this version. There were also changes to how models handle multiple descriptor types, which also makes them incompatible with previous versions. However, this can be fixed by modifying the old JSON files as illustrated in commits 7d3f863 and 6564f02.
- This change was made to allow several types of descriptors to be calculated and used efficiently (facilitated by a the
- 'LowVarianceFilter` now includes boundary in the filtered features, e.g. if threshold is 0.1, also features that
have a variance of 0.1 will be removed. - Added the ExtendedValenceSignature molecular descriptor based on Jean-Loup Faulon's work.
- removed default parameter setting scikit-learn SVC and SVR
max_iter
10000. - added
matthews_corrcoef
to the supported metrics for binary classification.
New Features
- New feature split
ManualSplit
for splitting data by a user-defined column - The index of the
MoleculeTable
can now be used to relate cross-validation and test outputs to the original molecules. Therefore, the index is now also saved in the model training outputs. - the
Papyrus.getData()
method now acceptsactivity_types
parameter to select a list of activity types to get. - Added the
checkMols
method toMoleculeTable
to use for indication of invalid molecules in the data. - Support for Sklearn Multitask modelling
- New class abstract class
Metric
, which is an abstract base class that allows for creating custom scorers. - Subclass
SklearnMetric
of theMetric
class, this class wraps the sklearn metrics, to allow for checking
the compatibility of each Sklearn scoring function with theQSPRSklearn
model type. - New class
TargetProperty
, to allow for multitask modelling, aQSPRdataset
has to have the option of multiple
targetproperties. To support this a targer property is now defined seperatly from the dataset as aTargetProperty
instance, which holds the information on name,TargetTask
(see also Changes) and threshold of the property. - Support for protein descriptors and PCM modeling was added.
- The
PCMDataSet
class was introduced that extendsQSPRDataset
and adds theaddProteinDescriptors
method, which can be used to calculate protein descriptors by linking information from the table with sequencing data.
- The
- Support for precalculated descriptors was added with
addCustomDescriptors
method ofMoleculeTable
.- It allows for adding precalculated descriptors to the
MoleculeTable
by linking the information from the table with external precalculated descriptors.
- It allows for adding precalculated descriptors to the
- The tutorial was improved with more detailed sections on data preparation and PCM modelling added.
- We agreed on and adopted a style guide for contributions to the package. This is described and exemplified in
docs/style_guide.py
. This is also supported by several development tools that were configured to check and automatically format the code. Instructions are included in the example file as well. - Style guide implemented. As a consequence, some classes, methods, and attributes were renamed to comply with the style guide. Some changes were:
- Fingerprint functions from RDKit are now also implemented. For checking the available fingerprints in qsprpred, the user can now access AVAIL_FPS through
from qsprpred.data.utils.descriptor_utils.fingerprints import AVAIL_FPS
. Fingerprint
abstract base class now moved toqsprpred.data.utils.descriptor_utils.interfaces
.- Instance attributes are now written in camelCase, and method arguments are snake_case. As an example of this, the old parameter
descsets
fromMoleculeDescriptorsCalculator
is now renamed asdesc_sets
, stored as the attributeself.descSets
. Functions are still written in snake_case.
- Fingerprint functions from RDKit are now also implemented. For checking the available fingerprints in qsprpred, the user can now access AVAIL_FPS through