Feature/scikit fingerprints #364

Hrovatin · 2024-09-04T04:40:09Z

See #359

CLAassistant · 2024-09-04T04:40:14Z

All committers have signed the CLA.

baybe/parameters/enum.py

Scienfitz

first round of comments

CONTRIBUTORS.md

baybe/_optional/info.py

baybe/parameters/substance.py

Scienfitz · 2024-09-05T06:31:26Z

baybe/parameters/substance.py

+ comp_df = chemistry.smiles_to_fingerprint_features(
+ vals,
+ fingerprint_encoder=AVAILABLE_SKFP_FP[self.encoding.name](),
+ prefix=pref,


I think we should add optional paramters here that might affect the fingerprint
for isntance bit length and radius

these could become attributes of the substanceparameter class

they should simply be ignored if a fingperirnt is selected that does not use this info

added kwargs, up to user to specify them

looks reasonable to me

did you add a test? Should perhaps add a parameterized test with 2 different kwargs x 3 fingerprints or similar

added to test_fingeprints.py - test a few different params passed for fp computation. All FP names are tested in hypothesis parameters strategy substance_parameters

baybe/parameters/substance.py

examples/Backtesting/full_lookup.py

examples/Basics/campaign.py

mypy.ini

tests/conftest.py

pyproject.toml

Scienfitz · 2024-09-05T07:21:04Z

For the future: I think theres no need for the fork, I added you as member, I think working directly on branches here has some advantages such that others can check out your branch seamlessly

CONTRIBUTORS.md

Co-authored-by: Martin Fitzner <[email protected]>

Hrovatin · 2024-09-06T06:28:59Z

tests/simulate_telemetry.py

@@ -55,9 +55,11 @@
 }

 parameters = [


Add tests for all fingerprints as some are not tested yet

@Scienfitz do you think we need to adapt this or would be what I added to test_fingerprints.py enough - just testing that all can be run from smiles to embedding?

a bit confused about the commen being on this fiel simulate_telemetry

as for fingerprint parameterization: once you wrote a test you can simply parameterize it via pytest. For instance I can imagine a tets looping over i) all fingperrints available, ie parameterized via SubstanceEncoding so it automatically always runs over all valdi choices - with default kwargs etc PLUS ii) some selected fps like ECFP run with special kwargs to also test different nbits and radii

i) SubstanceEncoding values are samples in hypothesis_strategies/parameters.py
ii) Added a few tests with different fp kwargs in tests/test_fingeprints.pt - done on ECFP example

…atin/baybe into feature/scikit_fingerprints

Hrovatin · 2024-09-06T12:39:09Z

@Scienfitz I made first round of changes covering all of the above
What is still missing is figuring ot where NotEnoughPointsLeftError originates from

EDIT:
The NotEnoughPointsLeftError is also resolved now, was associated with test param naming

AdrianSosic

Hi @Hrovatin, thanks for taking care of this item. Haven't had the time yet to go fully over it (and probably won't before I leave). But here at least a few points from my first glance

examples/Backtesting/full_lookup.py

baybe/_optional/info.py

AdrianSosic · 2024-09-06T14:42:26Z

baybe/_optional/chem.py

- from rdkit import Chem, RDLogger
- from rdkit.Chem.rdMolDescriptors import GetMorganFingerprintAsBitVect
+ from rdkit import Chem
+ from skfp import fingerprints as skfp_fingerprints


I think the renaming should rather happen in the importing file, not here

why is this even renamed?

I added it just to be clear where they are coming from, can remove renaming

CONTRIBUTORS.md

baybe/parameters/substance.py

AdrianSosic · 2024-09-06T14:51:22Z

baybe/parameters/substance.py

+ converter=lambda x: (
+ # Passed enum
+ x
+ if isinstance(x, SubstanceEncoding)
+ # Passed enum name
+ else (
+ SubstanceEncoding[x]
+ if x in SubstanceEncoding.__members__
+ # Passed enum value
+ else SubstanceEncoding(x)
+ )
+ ),


Haven't yet quite looked in detail at the logic, but this by far the longest converter I've seen 😄 There must be a more elegant / readable way

I think the others are shorter for enums as names equal values, but this enum has currently different names and values, which requires these checks. But feel free to suggest how to adapt

perhaps have a try to let chatgpt compress this to the max, but overall I dont think this is terribly long (the situation is simply more complex as with converters used elsewhere with enums)
other than that feel free to resolve

not relevant any more as enum and hence this field was changed

AdrianSosic · 2024-09-06T14:52:57Z

baybe/parameters/enum.py

 class SubstanceEncoding(ParameterEncoding):
 """Available encodings for substance parameters."""

- MORDRED = "MORDRED"
- """Encoding based on Mordred chemical descriptors."""
+ AtomPairFingerprint = "ATOMPAIR"


Usually, we use UPPER_CASE writing for enum values (see other encodings, for example). Also, not sure if the "Fingerprint" at the end is really necessary? If possible, I'd rather sync the enum field name with its value

I wanted to map parameters (currently) enum values to class names used for scikit-fingerprint (enum names) so that they can be directly converted by SubstanceParameter. I can probably switch the two (but need to test if that breaks sth else).

if this correspondence is required its fine for me

Now the SubstanceEncoding enum follows this rule, but there is a new enum with mapping to skfp classes (added for the new helper function that converts fp names to cls+params

Co-authored-by: AdrianSosic <[email protected]>

Scienfitz

I would encourage you to perhaps write one or two more tests to really check the new functionality like kwargs, deprecation etc

baybe/_optional/info.py

baybe/parameters/enum.py

CHANGELOG.md

Scienfitz · 2024-09-09T18:02:36Z

baybe/parameters/enum.py

 class SubstanceEncoding(ParameterEncoding):
 """Available encodings for substance parameters."""

- MORDRED = "MORDRED"
- """Encoding based on Mordred chemical descriptors."""
+ AtomPairFingerprint = "ATOMPAIR"


if this correspondence is required its fine for me

baybe/parameters/substance.py

tests/hypothesis_strategies/parameters.py

baybe/utils/chemistry.py

Scienfitz · 2024-09-10T11:52:42Z

@Hrovatin It would be great if, as a final sanity check, you could run the examples/backtesting/full_lookup.py example and post the resulting picture. Ive just done so on your fork and there are two observations

i) results improve which is fantastic
ii) theres also somethign super weird with the rdkit curve

can you confirm?

…atin/baybe into feature/scikit_fingerprints

Co-authored-by: Martin Fitzner <[email protected]>

…atin/baybe into feature/scikit_fingerprints

CHANGELOG.md

baybe/_optional/info.py

baybe/parameters/enum.py

docs/userguide/parameters.md

tests/test_fingerprints.py

baybe/parameters/enum.py

Scienfitz · 2024-09-26T17:34:01Z

@Hrovatin once the open comments are tended to we can mark this PR as ready for review

As last step before doing so: please i) make a copy of this branch as backup and then ii) merge main into this to update all the other changes that came into the repo since you started the fork. form the looks of it it doesnt seem many conflcits so that merge should be ok. Due to the amount of changes I thing rebasing is not an option but if you make backup branches you could also try that

Co-authored-by: Martin Fitzner <[email protected]>

Hrovatin · 2024-10-04T07:02:16Z

@Scienfitz for RDKIt there is some ruggedness where it flattens off

I checked the encoding statistics and RDKit does not seem to differ from ECFP in their distn.

N of features:

Here are mean and variance for each feature:

…/scikit_fingerprints

Hrovatin · 2024-10-04T07:50:40Z

@Scienfitz I made the merge. Please check comments that are still open and lmk if I should change sth or close them if ok with you

Hrovatin · 2024-10-04T08:32:14Z

With tox -e fulltest-py312 I get - any ideas?

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <linear_operator.operators.matmul_linear_operator.MatmulLinearOperator object at 0x374c8f770>, eigenvectors = True, return_evals_as_lazy = False

    def _symeig(
        self: Float[LinearOperator, "*batch N N"],
        eigenvectors: bool = False,
        return_evals_as_lazy: Optional[bool] = False,
    ) -> Tuple[Float[Tensor, "*batch M"], Optional[Float[LinearOperator, "*batch N M"]]]:
        r"""
        Method that allows implementing special-cased symeig computation. Should not be called directly
        """
        from linear_operator.operators.dense_linear_operator import DenseLinearOperator
    
        if settings.verbose_linalg.on():
            settings.verbose_linalg.logger.debug(f"Running symeig on a matrix of size {self.shape}.")
    
        # potentially perform decomposition in double precision for numerical stability
        dtype = self.dtype
>       evals, evecs = torch.linalg.eigh(self.to_dense().to(dtype=settings._linalg_dtype_symeig.value()))
E       torch._C._LinAlgError: linalg.eigh: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).

.tox/fulltest-py312/lib/python3.12/site-packages/linear_operator/operators/_linear_operator.py:892: _LinAlgError

The above exception was the direct cause of the following exception:

campaign = Campaign(searchspace=SearchSpace(discrete=SubspaceDiscrete(parameters=(CategoricalParameter(name='Categorical_1', _val... A            OK         2.0   22.748903        1    1.0, _cached_recommendation=Empty DataFrame
Columns: []
Index: [])
n_iterations = 3, batch_size = 3

    @pytest.mark.slow
    @pytest.mark.parametrize(
        "kernel", valid_kernels, ids=[c.__class__ for c in valid_kernels]
    )
    @pytest.mark.parametrize("n_iterations", [3], ids=["i3"])
    def test_kernels(campaign, n_iterations, batch_size):
>       run_iterations(campaign, n_iterations, batch_size)

tests/test_iterations.py:243: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

rs = <RetryCallState 13644971472: attempt #5; slept for 0.0; last result: failed (_LinAlgError linalg.eigh: (Batch element ... failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).)>

    def exc_check(rs: "RetryCallState") -> None:
        fut = t.cast(Future, rs.outcome)
        retry_exc = self.retry_error_cls(fut)
        if self.reraise:
            raise retry_exc.reraise()
>       raise retry_exc from fut.exception()
E       tenacity.RetryError: RetryError[<Future at 0x374c696d0 state=finished raised _LinAlgError>]

Scienfitz · 2024-10-04T08:40:55Z

@Hrovatin
the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

You can ignore the _LinAlgError , it is one of the exceptions. Some of our parameter combinations sometimes seem to cause numerical instabilities that even after several repeats with different random seeds will fail. This happens randomly, you can ignore it for the ourpose of this PR (it is often fixed by re-running the failed job anyway)

A second error you can ignore in the CI is that soemtimes serializations seem to fail, without a good indication as to why. This does not hapen lcoally oin clean environments and likely is an artifact of packages, the agent etc, if it happens in your PR ignore it

Hrovatin · 2024-10-04T11:34:13Z

@Hrovatin the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

When just running SubstanceParameter(name=chem_type, data=data, encoding=encoding).comp_df with RDKit and the molecules from the example the shape of samples in the output df is correct. Any other ideas where this could be coming from?

Hrovatin added 2 commits September 3, 2024 16:57

replace fingeprints with scikit-fingerprints package

c3804c1

add myself to contributors

d6f56d2

Hrovatin added 2 commits September 4, 2024 07:46

fix mypy ingnore imports

d986512

attempt to fix enum, not resolved

ec293b2

AdrianSosic reviewed Sep 4, 2024

View reviewed changes

baybe/parameters/enum.py Outdated Show resolved Hide resolved

AdrianSosic added enhancement Expand / change existing functionality new feature New functionality labels Sep 4, 2024

Scienfitz reviewed Sep 5, 2024

View reviewed changes

Scienfitz assigned Hrovatin Sep 5, 2024

AVHopp reviewed Sep 5, 2024

View reviewed changes

CONTRIBUTORS.md Show resolved Hide resolved

Update CONTRIBUTORS.md

630d896

Co-authored-by: Martin Fitzner <[email protected]>

Hrovatin commented Sep 6, 2024

View reviewed changes

Hrovatin added 2 commits September 6, 2024 14:36

review 1

bb19b0e

Merge branch 'feature/scikit_fingerprints' of https://github.com/Hrov…

971efad

…atin/baybe into feature/scikit_fingerprints

Hrovatin added 4 commits September 6, 2024 14:57

fix test param naming that caused NotEnoughPointsLeftError

d5809a7

update changelog

f290853

Add parameters test for FP encoding aliases

c0d16a3

update imports

a645647

AdrianSosic reviewed Sep 6, 2024

View reviewed changes

Update CONTRIBUTORS.md

e1d4c0f

Co-authored-by: AdrianSosic <[email protected]>

Scienfitz requested changes Sep 9, 2024

View reviewed changes

Hrovatin added 5 commits September 17, 2024 11:05

comments and typos

8161cb9

Merge branch 'feature/scikit_fingerprints' of https://github.com/Hrov…

523dd19

…atin/baybe into feature/scikit_fingerprints

add fingeprint generation test

a8a00fe

adapt header on package availability

218c501

change field default from dict obj to factory

6c3c489

Hrovatin and others added 12 commits September 17, 2024 14:57

deprecate morgan fp

82de6d1

shorten changelog

068f08c

test deprecated FP name and that it warns about deprecation

a469caa

add a few popular fingeprint examples to user guide

71375b3

add fingerprint kwargs example

ce57ef5

Update baybe/utils/chemistry.py

378b551

Co-authored-by: Martin Fitzner <[email protected]>

Update baybe/utils/chemistry.py

619b87e

Co-authored-by: Martin Fitzner <[email protected]>

Update baybe/utils/chemistry.py

0f0a627

Co-authored-by: Martin Fitzner <[email protected]>

Merge branch 'feature/scikit_fingerprints' of https://github.com/Hrov…

9cd467b

…atin/baybe into feature/scikit_fingerprints

move kwargs handlig to top

7f1b3c7

rename smiles to mol as may be str or mol obj

b47209e

test for fp embedding size

7c21c4f

Scienfitz reviewed Sep 22, 2024

View reviewed changes

Hrovatin and others added 10 commits September 30, 2024 11:08

Update CHANGELOG.md

797616e

Co-authored-by: Martin Fitzner <[email protected]>

Update CHANGELOG.md

80eb43a

Co-authored-by: Martin Fitzner <[email protected]>

Update baybe/parameters/enum.py

3ac8598

Co-authored-by: Martin Fitzner <[email protected]>

Update docs/userguide/parameters.md

b0759c4

Co-authored-by: Martin Fitzner <[email protected]>

Update docs/userguide/parameters.md

3f34435

Co-authored-by: Martin Fitzner <[email protected]>

add default kwargs to morgan fp

d8903cc

add Morgan_FP deprecation

bf29920

docs conformer kwargs

b2960c6

remove n_jobs from example on single mol

f5bbea6

test fingerprint computation function

59ede88

Merge branch 'main' of https://github.com/emdgroup/baybe into feature…

015f185

…/scikit_fingerprints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/scikit fingerprints #364

Feature/scikit fingerprints #364

Hrovatin commented Sep 4, 2024

CLAassistant commented Sep 4, 2024 •

edited

Loading

Scienfitz left a comment

Scienfitz Sep 5, 2024

Hrovatin Sep 6, 2024

Scienfitz Sep 9, 2024

Hrovatin Sep 17, 2024 •

edited

Loading

Scienfitz commented Sep 5, 2024

Hrovatin Sep 6, 2024

Hrovatin Sep 17, 2024

Scienfitz Sep 26, 2024

Hrovatin Oct 2, 2024 •

edited

Loading

Hrovatin commented Sep 6, 2024 •

edited

Loading

AdrianSosic left a comment

AdrianSosic Sep 6, 2024

Scienfitz Sep 9, 2024

Hrovatin Sep 17, 2024

AdrianSosic Sep 6, 2024

Hrovatin Sep 6, 2024

Scienfitz Sep 26, 2024

Hrovatin Oct 2, 2024 •

edited

Loading

AdrianSosic Sep 6, 2024

Hrovatin Sep 6, 2024 •

edited

Loading

Scienfitz Sep 9, 2024

Hrovatin Oct 2, 2024

Scienfitz left a comment

Scienfitz Sep 9, 2024

Scienfitz commented Sep 10, 2024

Scienfitz commented Sep 26, 2024

Hrovatin commented Oct 4, 2024

Hrovatin commented Oct 4, 2024

Hrovatin commented Oct 4, 2024

Scienfitz commented Oct 4, 2024 •

edited

Loading

Hrovatin commented Oct 4, 2024

Feature/scikit fingerprints #364

Are you sure you want to change the base?

Feature/scikit fingerprints #364

Conversation

Hrovatin commented Sep 4, 2024

CLAassistant commented Sep 4, 2024 • edited Loading

Scienfitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hrovatin Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Scienfitz commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hrovatin Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Hrovatin commented Sep 6, 2024 • edited Loading

AdrianSosic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hrovatin Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hrovatin Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scienfitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Scienfitz commented Sep 10, 2024

Scienfitz commented Sep 26, 2024

Hrovatin commented Oct 4, 2024

Hrovatin commented Oct 4, 2024

Hrovatin commented Oct 4, 2024

Scienfitz commented Oct 4, 2024 • edited Loading

Hrovatin commented Oct 4, 2024

CLAassistant commented Sep 4, 2024 •

edited

Loading

Hrovatin Sep 17, 2024 •

edited

Loading

Hrovatin Oct 2, 2024 •

edited

Loading

Hrovatin commented Sep 6, 2024 •

edited

Loading

Hrovatin Oct 2, 2024 •

edited

Loading

Hrovatin Sep 6, 2024 •

edited

Loading

Scienfitz commented Oct 4, 2024 •

edited

Loading