-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/scikit fingerprints #364
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first round of comments
comp_df = chemistry.smiles_to_fingerprint_features( | ||
vals, | ||
fingerprint_encoder=AVAILABLE_SKFP_FP[self.encoding.name](), | ||
prefix=pref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add optional paramters here that might affect the fingerprint
for isntance bit length and radius
these could become attributes of the substanceparameter class
they should simply be ignored if a fingperirnt is selected that does not use this info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added kwargs, up to user to specify them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable to me
did you add a test? Should perhaps add a parameterized test with 2 different kwargs x 3 fingerprints or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to test_fingeprints.py
- test a few different params passed for fp computation. All FP names are tested in hypothesis parameters strategy substance_parameters
For the future: I think theres no need for the fork, I added you as member, I think working directly on branches here has some advantages such that others can check out your branch seamlessly |
Co-authored-by: Martin Fitzner <[email protected]>
tests/simulate_telemetry.py
Outdated
@@ -55,9 +55,11 @@ | |||
} | |||
|
|||
parameters = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for all fingerprints as some are not tested yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Scienfitz do you think we need to adapt this or would be what I added to test_fingerprints.py
enough - just testing that all can be run from smiles to embedding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confused about the commen being on this fiel simulate_telemetry
as for fingerprint parameterization: once you wrote a test you can simply parameterize it via pytest. For instance I can imagine a tets looping over i) all fingperrints available, ie parameterized via SubstanceEncoding
so it automatically always runs over all valdi choices - with default kwargs etc PLUS ii) some selected fps like ECFP run with special kwargs to also test different nbits and radii
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i) SubstanceEncoding values are samples in hypothesis_strategies/parameters.py
ii) Added a few tests with different fp kwargs in tests/test_fingeprints.pt
- done on ECFP example
…atin/baybe into feature/scikit_fingerprints
@Scienfitz I made first round of changes covering all of the above EDIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Hrovatin, thanks for taking care of this item. Haven't had the time yet to go fully over it (and probably won't before I leave). But here at least a few points from my first glance
from rdkit import Chem, RDLogger | ||
from rdkit.Chem.rdMolDescriptors import GetMorganFingerprintAsBitVect | ||
from rdkit import Chem | ||
from skfp import fingerprints as skfp_fingerprints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the renaming should rather happen in the importing file, not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this even renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it just to be clear where they are coming from, can remove renaming
baybe/parameters/substance.py
Outdated
converter=lambda x: ( | ||
# Passed enum | ||
x | ||
if isinstance(x, SubstanceEncoding) | ||
# Passed enum name | ||
else ( | ||
SubstanceEncoding[x] | ||
if x in SubstanceEncoding.__members__ | ||
# Passed enum value | ||
else SubstanceEncoding(x) | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't yet quite looked in detail at the logic, but this by far the longest converter I've seen 😄 There must be a more elegant / readable way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the others are shorter for enums as names equal values, but this enum has currently different names and values, which requires these checks. But feel free to suggest how to adapt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps have a try to let chatgpt compress this to the max, but overall I dont think this is terribly long (the situation is simply more complex as with converters used elsewhere with enums)
other than that feel free to resolve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not relevant any more as enum and hence this field was changed
baybe/parameters/enum.py
Outdated
class SubstanceEncoding(ParameterEncoding): | ||
"""Available encodings for substance parameters.""" | ||
|
||
MORDRED = "MORDRED" | ||
"""Encoding based on Mordred chemical descriptors.""" | ||
AtomPairFingerprint = "ATOMPAIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we use UPPER_CASE writing for enum values (see other encodings, for example). Also, not sure if the "Fingerprint" at the end is really necessary? If possible, I'd rather sync the enum field name with its value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to map parameters (currently) enum values to class names used for scikit-fingerprint
(enum names) so that they can be directly converted by SubstanceParameter. I can probably switch the two (but need to test if that breaks sth else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this correspondence is required its fine for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the SubstanceEncoding enum follows this rule, but there is a new enum with mapping to skfp classes (added for the new helper function that converts fp names to cls+params
Co-authored-by: AdrianSosic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would encourage you to perhaps write one or two more tests to really check the new functionality like kwargs, deprecation etc
baybe/parameters/enum.py
Outdated
class SubstanceEncoding(ParameterEncoding): | ||
"""Available encodings for substance parameters.""" | ||
|
||
MORDRED = "MORDRED" | ||
"""Encoding based on Mordred chemical descriptors.""" | ||
AtomPairFingerprint = "ATOMPAIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this correspondence is required its fine for me
@Hrovatin It would be great if, as a final sanity check, you could run the i) results improve which is fantastic can you confirm? |
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
…atin/baybe into feature/scikit_fingerprints
@Hrovatin once the open comments are tended to we can mark this PR as ready for review As last step before doing so: please i) make a copy of this branch as backup and then ii) merge main into this to update all the other changes that came into the repo since you started the fork. form the looks of it it doesnt seem many conflcits so that merge should be ok. Due to the amount of changes I thing rebasing is not an option but if you make backup branches you could also try that |
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
Co-authored-by: Martin Fitzner <[email protected]>
@Scienfitz for RDKIt there is some ruggedness where it flattens off I checked the encoding statistics and RDKit does not seem to differ from ECFP in their distn. N of features: |
…/scikit_fingerprints
@Scienfitz I made the merge. Please check comments that are still open and lmk if I should change sth or close them if ok with you |
With
|
@Hrovatin You can ignore the A second error you can ignore in the CI is that soemtimes serializations seem to fail, without a good indication as to why. This does not hapen lcoally oin clean environments and likely is an artifact of packages, the agent etc, if it happens in your PR ignore it |
When just running |
See #359