Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/scikit fingerprints #364

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

Hrovatin
Copy link
Collaborator

@Hrovatin Hrovatin commented Sep 4, 2024

See #359

@CLAassistant
Copy link

CLAassistant commented Sep 4, 2024

CLA assistant check
All committers have signed the CLA.

@AdrianSosic AdrianSosic added enhancement Expand / change existing functionality new feature New functionality labels Sep 4, 2024
Copy link
Collaborator

@Scienfitz Scienfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round of comments

CONTRIBUTORS.md Outdated Show resolved Hide resolved
baybe/_optional/info.py Outdated Show resolved Hide resolved
baybe/parameters/substance.py Outdated Show resolved Hide resolved
comp_df = chemistry.smiles_to_fingerprint_features(
vals,
fingerprint_encoder=AVAILABLE_SKFP_FP[self.encoding.name](),
prefix=pref,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add optional paramters here that might affect the fingerprint
for isntance bit length and radius

these could become attributes of the substanceparameter class

they should simply be ignored if a fingperirnt is selected that does not use this info

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added kwargs, up to user to specify them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me

did you add a test? Should perhaps add a parameterized test with 2 different kwargs x 3 fingerprints or similar

Copy link
Collaborator Author

@Hrovatin Hrovatin Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to test_fingeprints.py - test a few different params passed for fp computation. All FP names are tested in hypothesis parameters strategy substance_parameters

baybe/parameters/substance.py Outdated Show resolved Hide resolved
examples/Backtesting/full_lookup.py Outdated Show resolved Hide resolved
examples/Basics/campaign.py Outdated Show resolved Hide resolved
mypy.ini Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@Scienfitz
Copy link
Collaborator

For the future: I think theres no need for the fork, I added you as member, I think working directly on branches here has some advantages such that others can check out your branch seamlessly

CONTRIBUTORS.md Show resolved Hide resolved
Co-authored-by: Martin Fitzner <[email protected]>
@@ -55,9 +55,11 @@
}

parameters = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for all fingerprints as some are not tested yet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Scienfitz do you think we need to adapt this or would be what I added to test_fingerprints.py enough - just testing that all can be run from smiles to embedding?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused about the commen being on this fiel simulate_telemetry

as for fingerprint parameterization: once you wrote a test you can simply parameterize it via pytest. For instance I can imagine a tets looping over i) all fingperrints available, ie parameterized via SubstanceEncoding so it automatically always runs over all valdi choices - with default kwargs etc PLUS ii) some selected fps like ECFP run with special kwargs to also test different nbits and radii

Copy link
Collaborator Author

@Hrovatin Hrovatin Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i) SubstanceEncoding values are samples in hypothesis_strategies/parameters.py
ii) Added a few tests with different fp kwargs in tests/test_fingeprints.pt - done on ECFP example

@Hrovatin
Copy link
Collaborator Author

Hrovatin commented Sep 6, 2024

@Scienfitz I made first round of changes covering all of the above
What is still missing is figuring ot where NotEnoughPointsLeftError originates from

EDIT:
The NotEnoughPointsLeftError is also resolved now, was associated with test param naming

Copy link
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Hrovatin, thanks for taking care of this item. Haven't had the time yet to go fully over it (and probably won't before I leave). But here at least a few points from my first glance

examples/Backtesting/full_lookup.py Outdated Show resolved Hide resolved
baybe/_optional/info.py Outdated Show resolved Hide resolved
from rdkit import Chem, RDLogger
from rdkit.Chem.rdMolDescriptors import GetMorganFingerprintAsBitVect
from rdkit import Chem
from skfp import fingerprints as skfp_fingerprints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the renaming should rather happen in the importing file, not here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this even renamed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it just to be clear where they are coming from, can remove renaming

CONTRIBUTORS.md Outdated Show resolved Hide resolved
baybe/parameters/substance.py Outdated Show resolved Hide resolved
Comment on lines 61 to 72
converter=lambda x: (
# Passed enum
x
if isinstance(x, SubstanceEncoding)
# Passed enum name
else (
SubstanceEncoding[x]
if x in SubstanceEncoding.__members__
# Passed enum value
else SubstanceEncoding(x)
)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet quite looked in detail at the logic, but this by far the longest converter I've seen 😄 There must be a more elegant / readable way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the others are shorter for enums as names equal values, but this enum has currently different names and values, which requires these checks. But feel free to suggest how to adapt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps have a try to let chatgpt compress this to the max, but overall I dont think this is terribly long (the situation is simply more complex as with converters used elsewhere with enums)
other than that feel free to resolve

Copy link
Collaborator Author

@Hrovatin Hrovatin Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant any more as enum and hence this field was changed

class SubstanceEncoding(ParameterEncoding):
"""Available encodings for substance parameters."""

MORDRED = "MORDRED"
"""Encoding based on Mordred chemical descriptors."""
AtomPairFingerprint = "ATOMPAIR"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we use UPPER_CASE writing for enum values (see other encodings, for example). Also, not sure if the "Fingerprint" at the end is really necessary? If possible, I'd rather sync the enum field name with its value

Copy link
Collaborator Author

@Hrovatin Hrovatin Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to map parameters (currently) enum values to class names used for scikit-fingerprint (enum names) so that they can be directly converted by SubstanceParameter. I can probably switch the two (but need to test if that breaks sth else).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this correspondence is required its fine for me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the SubstanceEncoding enum follows this rule, but there is a new enum with mapping to skfp classes (added for the new helper function that converts fp names to cls+params

Co-authored-by: AdrianSosic <[email protected]>
Copy link
Collaborator

@Scienfitz Scienfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would encourage you to perhaps write one or two more tests to really check the new functionality like kwargs, deprecation etc

baybe/_optional/info.py Show resolved Hide resolved
baybe/parameters/enum.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
class SubstanceEncoding(ParameterEncoding):
"""Available encodings for substance parameters."""

MORDRED = "MORDRED"
"""Encoding based on Mordred chemical descriptors."""
AtomPairFingerprint = "ATOMPAIR"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this correspondence is required its fine for me

baybe/parameters/substance.py Outdated Show resolved Hide resolved
tests/hypothesis_strategies/parameters.py Outdated Show resolved Hide resolved
baybe/utils/chemistry.py Outdated Show resolved Hide resolved
baybe/utils/chemistry.py Outdated Show resolved Hide resolved
baybe/utils/chemistry.py Outdated Show resolved Hide resolved
baybe/utils/chemistry.py Outdated Show resolved Hide resolved
@Scienfitz
Copy link
Collaborator

@Hrovatin It would be great if, as a final sanity check, you could run the examples/backtesting/full_lookup.py example and post the resulting picture. Ive just done so on your fork and there are two observations

i) results improve which is fantastic
ii) theres also somethign super weird with the rdkit curve

can you confirm?

CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
baybe/_optional/info.py Show resolved Hide resolved
baybe/parameters/enum.py Outdated Show resolved Hide resolved
docs/userguide/parameters.md Outdated Show resolved Hide resolved
docs/userguide/parameters.md Show resolved Hide resolved
docs/userguide/parameters.md Show resolved Hide resolved
tests/test_fingerprints.py Outdated Show resolved Hide resolved
baybe/parameters/enum.py Outdated Show resolved Hide resolved
@Scienfitz
Copy link
Collaborator

@Hrovatin once the open comments are tended to we can mark this PR as ready for review

As last step before doing so: please i) make a copy of this branch as backup and then ii) merge main into this to update all the other changes that came into the repo since you started the fork. form the looks of it it doesnt seem many conflcits so that merge should be ok. Due to the amount of changes I thing rebasing is not an option but if you make backup branches you could also try that

@Hrovatin
Copy link
Collaborator Author

Hrovatin commented Oct 4, 2024

@Scienfitz for RDKIt there is some ruggedness where it flattens off
image

I checked the encoding statistics and RDKit does not seem to differ from ECFP in their distn.

N of features:

image

Here are mean and variance for each feature:
image
image

@Hrovatin
Copy link
Collaborator Author

Hrovatin commented Oct 4, 2024

@Scienfitz I made the merge. Please check comments that are still open and lmk if I should change sth or close them if ok with you

@Hrovatin
Copy link
Collaborator Author

Hrovatin commented Oct 4, 2024

With tox -e fulltest-py312 I get - any ideas?

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <linear_operator.operators.matmul_linear_operator.MatmulLinearOperator object at 0x374c8f770>, eigenvectors = True, return_evals_as_lazy = False

    def _symeig(
        self: Float[LinearOperator, "*batch N N"],
        eigenvectors: bool = False,
        return_evals_as_lazy: Optional[bool] = False,
    ) -> Tuple[Float[Tensor, "*batch M"], Optional[Float[LinearOperator, "*batch N M"]]]:
        r"""
        Method that allows implementing special-cased symeig computation. Should not be called directly
        """
        from linear_operator.operators.dense_linear_operator import DenseLinearOperator
    
        if settings.verbose_linalg.on():
            settings.verbose_linalg.logger.debug(f"Running symeig on a matrix of size {self.shape}.")
    
        # potentially perform decomposition in double precision for numerical stability
        dtype = self.dtype
>       evals, evecs = torch.linalg.eigh(self.to_dense().to(dtype=settings._linalg_dtype_symeig.value()))
E       torch._C._LinAlgError: linalg.eigh: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).

.tox/fulltest-py312/lib/python3.12/site-packages/linear_operator/operators/_linear_operator.py:892: _LinAlgError

The above exception was the direct cause of the following exception:

campaign = Campaign(searchspace=SearchSpace(discrete=SubspaceDiscrete(parameters=(CategoricalParameter(name='Categorical_1', _val... A            OK         2.0   22.748903        1    1.0, _cached_recommendation=Empty DataFrame
Columns: []
Index: [])
n_iterations = 3, batch_size = 3

    @pytest.mark.slow
    @pytest.mark.parametrize(
        "kernel", valid_kernels, ids=[c.__class__ for c in valid_kernels]
    )
    @pytest.mark.parametrize("n_iterations", [3], ids=["i3"])
    def test_kernels(campaign, n_iterations, batch_size):
>       run_iterations(campaign, n_iterations, batch_size)

tests/test_iterations.py:243: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

rs = <RetryCallState 13644971472: attempt #5; slept for 0.0; last result: failed (_LinAlgError linalg.eigh: (Batch element ... failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).)>

    def exc_check(rs: "RetryCallState") -> None:
        fut = t.cast(Future, rs.outcome)
        retry_exc = self.retry_error_cls(fut)
        if self.reraise:
            raise retry_exc.reraise()
>       raise retry_exc from fut.exception()
E       tenacity.RetryError: RetryError[<Future at 0x374c696d0 state=finished raised _LinAlgError>]

@Scienfitz
Copy link
Collaborator

Scienfitz commented Oct 4, 2024

@Hrovatin
the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

You can ignore the _LinAlgError , it is one of the exceptions. Some of our parameter combinations sometimes seem to cause numerical instabilities that even after several repeats with different random seeds will fail. This happens randomly, you can ignore it for the ourpose of this PR (it is often fixed by re-running the failed job anyway)

A second error you can ignore in the CI is that soemtimes serializations seem to fail, without a good indication as to why. This does not hapen lcoally oin clean environments and likely is an artifact of packages, the agent etc, if it happens in your PR ignore it

@Hrovatin
Copy link
Collaborator Author

Hrovatin commented Oct 4, 2024

@Hrovatin the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

When just running SubstanceParameter(name=chem_type, data=data, encoding=encoding).comp_df with RDKit and the molecules from the example the shape of samples in the output df is correct. Any other ideas where this could be coming from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Expand / change existing functionality new feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants