Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a base class for classes with JSON-interchange functions #292

Merged
merged 28 commits into from
Jan 12, 2025

Conversation

rwxayheee
Copy link
Contributor

@rwxayheee rwxayheee commented Dec 18, 2024

This PR introduces a base class for many other existing classes that need the JSON-interchange functionalities. The common, highe-level functions (from_json, to_json) are inherited from the base class. Whereas the specific encoding-decoding logics are defined within the individual subclass scope.
The major objective is to reduce code duplication, improve error handling, and make future maintenance easier. This may or may not be an over-engineering of simple things that are already working, but I thought it would be really useful when I opened #279. I hope this can also help with the consistency issue in #288.

  • Introduce BaseJSONParsable at 5ec1bf7
    This working checkpoint shows the basic layout of class BaseJSONParsable. Note that RDKitMoleculeSetup is still a subclass of MoleculeSetup. For subclasses with additional inheritance relationships like this, the decoder/encoder of the parent may be used in the child class.

  • ResidueChemTemplates: Harmonize from_json and from (data) json at 1cd7a77
    This working checkpoint shows how the parameter data JSON string and the standard serialized JSON of ResidueChemTemplates may be parsed in the same way.

  • MoleculePreparation: harmonize from_config and from (deserialized json object) dict at 252bc95
    For fix from_json methods, add from_json_file and from_dict methods #288 (comment)

  • Revisit optional_keys
    Future changes like add cycle_break attribute to molsetup Bond #254 should be possible by changes made to init and the JSON-interchange region in the subclass.

@rwxayheee rwxayheee changed the title Introduce Base Class for JSON Parsing Introduce a base class for classes with JSON-interchange functions Dec 19, 2024
@rwxayheee rwxayheee marked this pull request as ready for review December 19, 2024 12:19
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 19, 2024

This is done, but open to any changes it may need. Here's a very brief summary:

Implemented code structure in this PR

  • A subclass of BaseJSONParsable needs the following assets that contain subclass-specific details:
    expected_json_keys: Optional[frozenset[str]] = None

    @classmethod
    def json_encoder(cls, obj):
        raise NotImplementedError("Subclasses must implement json_encoder.")
    
    @classmethod
    def _decode_object(cls, obj):
        raise NotImplementedError("Subclasses must implement _decode_object.")
  • BaseJSONParsable provides the other high-level functions that are sharable:
    @classmethod
    def json_decoder(cls, obj: dict[str, Any], check_keys: bool = True):
        # Avoid using json_decoder as object_hook for nested objects
        if not isinstance(obj, dict):
            return obj
        if check_keys and not cls.expected_json_keys.issubset(obj.keys()):
            return obj

        # Delegate specific decoding logic to a subclass-defined method
        return cls._decode_object(obj)

    @classmethod
    def from_json(cls, json_string: str):
        try: 
            # Log mismatched keys if non-critical 
            obj_dict = json.loads(json_string)
            actual_keys = obj_dict.keys()
            if actual_keys != cls.expected_json_keys:
                missing_keys = cls.expected_json_keys - actual_keys
                extra_keys = actual_keys - cls.expected_json_keys
                logging.warning(
                    f"Key mismatch for {cls.__name__}. "
                    f"Missing keys: {missing_keys}, Extra keys: {extra_keys}."
                )

            try: 
                obj = json.loads(json_string, object_hook=cls.json_decoder)
            
            # Error occurred within cls.json_decoder
            except ValueError as decoder_error: 
                raise RuntimeError(
                    f"An error occurred when creating {cls.__name__} from JSON."
                    f"Error: {decoder_error}"
                )

            # Validate the decoded object type
            if not isinstance(obj, cls):
                raise ValueError(
                    f"Decoded object is not an instance of {cls.__name__}. "
                    f"Got: {type(obj)}"
                )
            return obj

        # Error occurred within json.loads
        except Exception as parser_error: 
            raise RuntimeError(
                    f"Unable to load JSON string for {cls.__name__}. "
                    f"Error: {parser_error}"
            )

    @classmethod
    def from_dict(cls, obj: dict) -> "BaseJSONParsable":
        return cls.json_decoder(obj, check_keys=False)
    
    def to_json(self):
        return json.dumps(self, default=self.__class__.json_encoder)

Function _decode_object focuses on the decoding logic, while from_json is on a higher level and is more for error handling. On the intermediate level, there's json_decoder as object_hook and it defers nested objects that can't be decoded by the current _decode_object logic.

  • Other possible implementations
    from_json(_file) and from_config are a bit tricky. Because the parameter or option files could be prepared in alternate ways (like manually), the files may or may not be strictly formatted and have all the required keys. Therefore, I didn't put these methods in the base class.

@rwxayheee
Copy link
Contributor Author

Usage note at fdc08f7 (same in 252bc95)

1. Export to JSON: Without explicitly naming the encoder

This is a changed usage, aiming to help ensure the correct encoding logic is always used.

Before

with open(filename, "w") as f:
    json.dump(polymer, f, cls=PolymerEncoder)

After

with open(filename, "w") as f:
    f.write(polymer.to_json())

2. Create from JSON: With specific decoder from subclass

This is relatively unchanged from the original implementation (which also requires the name of expected class). The name of the function is now shorter.

Before

decoded_polymer = json.loads(
        json_str, object_hook=polymer.polymer_json_decoder
    )

After

decoded_polymer = Polymer.from_json(json_str)

3. Override the "Correct" decoder/encoder

It seems possible to override the corresponding decoder/encoder:

starting_molsetup = populated_rdkit_molsetup
# full JSON dump (correct) 
json_str = starting_molsetup.to_json() 
# partial JSON dump, as the RDKitMoleculeSetup is an extended MoleculeSetup
json_str2 = json.dumps(starting_molsetup, default=MoleculeSetup.json_encoder, skipkeys=True) 

# creates RDKitMoleculeSetup (correct)
decoded_molsetup = RDKitMoleculeSetup.from_json(json_str)
# creates a MoleculeSetup
decoded_molsetup2 = json.loads(json_str, object_hook = MoleculeSetup.json_decoder)
# creates a MoleculeSetup
decoded_molsetup3 = MoleculeSetup.from_dict(json.loads(json_str))

4. Lenient Parsing (from_dict, from_json_file, from_config)

from_dict is exactly json_decoder but allows excess keys.

from_json_file and from_config are currently only implemented for certain classes, that may come from a manually written option or parameter file.

5. Changes of certain keys

To re-use the decoder/encoder, there's some minor change in the spelling of expected keys. They were changed to be consistent with the existing JSON files in the meeko/data directory.

class ResidueTemplate:
  expected_json_keys = {"mol", "link_labels", "atom_name"} # was atom_names
class ResiduePadder:
    expected_json_keys = {
        "rxn_smarts",
        "adjacent_res_smarts", # was adjacent_smarts
        "auto_blunt",
    }

@diogomart please let me know if you like this. It does not introduce new features but this has been a good experience for me to review the JSON-bound properties. I understand more about the data structure now.

@rwxayheee rwxayheee requested a review from diogomart December 19, 2024 23:05
Comment on lines +114 to +117
if not isinstance(obj, dict):
return obj
if check_keys and not cls.expected_json_keys.issubset(obj.keys()):
return obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the original code had this already, but we should try to raise an error instead of returning the input obj. Returning the input obj makes errors very hard to find as it just fails silently. Unless there is a purpose in returning the input obj.

Copy link
Contributor Author

@rwxayheee rwxayheee Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I said two things wrong in the comment above (I removed it already). Correction: (1) The check of returned object must be placed else where, if json_decoder needs to be used as object_hook. Otherwise, nested objects will be decoded with the same object_hook. (2) the check is deferred to the higher level function from_json in base class, not class-specific _decode_object

Copy link
Contributor Author

@rwxayheee rwxayheee Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code structure is explained here:
#292 (comment)

Function _decode_object focuses on the decoding logic, while from_json is on a higher level and is more for error handling. On the intermediate level, there's json_decoder as object_hook and it defers nested objects that can't be decoded by the current _decode_object logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - thanks for clarifying.

meeko/polymer.py Outdated
Comment on lines 2616 to 2642
def json_encoder(cls, obj: "ResidueTemplate") -> Optional[dict[str, Any]]:
output_dict = {
"mol": rdMolInterchange.MolToJSON(obj.mol),
"link_labels": obj.link_labels,
"atom_name": obj.atom_names,
}
return output_dict

# Keys to check for deserialized JSON
expected_json_keys = {"mol", "link_labels", "atom_name"}

@classmethod
def _decode_object(cls, obj: dict[str, Any]):

# Converting ResidueTemplate init values that need conversion
deserialized_mol = rdkit_mol_from_json(obj.get("mol"))
# do not write canonical smiles to preserve original atom order
if deserialized_mol:
mol_smiles = rdkit.Chem.MolToSmiles(deserialized_mol, canonical=False)
# if dry json (data) is supplied
else:
mol_smiles = obj.get("smiles")

link_labels = convert_to_int_keyed_dict(obj.get("link_labels"))

# Construct a ResidueTemplate object
residue_template = cls(mol_smiles, None, obj.get("atom_name"))
Copy link
Contributor

@diogomart diogomart Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming atom_names to atom_name makes it fail to read polymer JSON written by meeko v0.6.1. I thing this was to reuse the from_dict function from the base class to load the default templates, which is great, but we must be able to read polymer JSON from v0.6.1. I'll push a (failing) test for this.

I'm also concerned about the obj.get(key) vs obj[key]. I think it's better to have a missing key error, which is easy to debug, than some trouble with an unexpected None with a more complex error traceback.

Copy link
Contributor Author

@rwxayheee rwxayheee Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj.get() here are being forgiving for (manually prepared) templates that are missing properties. Could we allow missing keys ("link_labels" and/or "atom_name") in template files?

Copy link
Contributor Author

@rwxayheee rwxayheee Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys are checked in from_json. The missing or excess keys will be logged from there. If it causes a failure of object creation it will throw an RuntimeError, after attempting to create the obj. Otherwise it will just be logged. I use obj.get() for some keys (? maybe not all, that's bad - I should have made these more visible and consistent) in the decoding logic to allow lenient parsing. In other words, the error handling is in from_json. The class-specific decoding function _decode_object is generally forgiving

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer an early return (no attempt to create obj at all) for missing keys? Currently the mismatched keys are logged if non-critical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, as as you added in comment below, just need to keep backwards compatibility.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 27, 2024

Hi @diogomart
Thanks for the review. I replied in comments. But here are two quick things I wanted to check with you:

When we're creating obj from JSON: Do you prefer an early return (no attempt to create obj at all) for missing keys? Currently the mismatched keys are logged if non-critical.

The use of obj.get() might be a bit inconsistent. I will double-check. But with the current code structure I'm personally ok with always using obj.get(), because I'm forgiving mismatched keys unless they cause a problem. Is this still ok or not ok?
I want to change all to obj.get(). but maybe I'm lacking the knowledge of what could go wrong in this situation..

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 27, 2024

At present, I use obj.get() in very few cases (8 occurrences). There are two main situations:

(1) The keys are optional for lenient parsing.

(2) The keys correspond to a computed attribute, not an initializer attribute. Thus, the keyword is optional for the __init__ function. But when I started marking the computed attribute vs. initializer attributes, I wanted to allow creation of objects from JSON in different stages of their lifecycle (processed or unprocessed). Therefore, I want to allow more obj.get()


@classmethod
def json_encoder(cls, obj):
raise NotImplementedError("Subclasses must implement json_encoder.")
Copy link
Contributor Author

@rwxayheee rwxayheee Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractmethod (happens to be a github username..) is better here. If we are doing a revision I will change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Jan 7, 2025

  • Support jsons with old syntax (attribute names)
  • Review use of obj.get()
  • Implements abstract method instead of not implemented error

in the corresponding _decode_object functions
include name of the class in the logging
currently only use for auto_bount, or when necessary for parsing templates
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Jan 8, 2025

Changes since 681a795

  • Enabled support for jsons with old keys (attribute names)
    old key: atom_names new key: atom_name
    This change affected the decoding logics of class Monomer, and ResidueTemplate.

old key: adjacent_smarts new key: adjacent_res_smarts
This change affected the class ResiduePadder.

To address this, I introduced a new class function, access_with_deprecated_key. It is inheritable from the base class. Additionally, it must be used together with lenient parsing, which is <class name>.json_decoder(check_keys=False).

  • Eliminated use of obj.get()
    This is now only used when necessary (for parsing of alternate formats).

  • Implemented abstract method instead of not implemented error

Sorry there were two changes I forgot to add in 8273d5f.

@diogomart This is ready, please let me know if it works! Thanks for your time and kind advice in advance

@diogomart
Copy link
Contributor

diogomart commented Jan 9, 2025

I think the try/except statements in from_json() obfuscate error messages. Consider the following example

from meeko.molsetup import Atom

a = Atom(42)
b = Atom.from_json(a.to_json())

If I remove all try/except and just execute the code, I get to know that the error occured here:

  File "/home/diogom/Work/code/Meeko/meeko/molsetup.py", line 249, in _decode_object
    pdbinfo = PDBAtomInfo(*obj["pdbinfo"])

but the try/except wrap the error traceback and it's harder to find the problem.

Is there an advantage in the try/except?

@rwxayheee
Copy link
Contributor Author

Hi @diogomart

Can you show which try/except wraps the traceback? It's possible to get the traceback while still using try/except. There's a nested try/except to distinguish different types of errors (see comments in code). But i'm not sure if this is the best approach to handle this.

Can you suggest an alternate approach?

@rwxayheee
Copy link
Contributor Author

We discussed a similar issue before, personally I don't have the habit of printing traceback (so i forgot to take care of this, I should though..), but I recall that we agreed on doing something like this, for the PolymerCreationError:

Meeko/meeko/polymer.py

Lines 484 to 495 in 07a38f8

class PolymerCreationError(RuntimeError):
def __init__(self, error: str, recommendations: str = None):
super().__init__(error) # main error message to pass to RuntimeError
self.error = error
self.recommendations = recommendations
exc_type, exc_value, exc_traceback = exc_info()
if exc_value is not None:
self.traceback = ''.join(traceback.format_exception(exc_type, exc_value, exc_traceback))
else:
self.traceback = None

Would you like the same handling, or do you prefer just not using try/except? To trace error raised from the decoding/encoding of a nested object it might need a little more work, but I could try

@diogomart
Copy link
Contributor

I would just remove the try/except in from_json() in jsonutils.py, then it looks like this (about half the size)

    @classmethod
    def from_json(cls, json_string: str):
        # Log mismatched keys if non-critical 
        obj_dict = json.loads(json_string)
        actual_keys = obj_dict.keys()
        if actual_keys != cls.expected_json_keys:
            missing_keys = cls.expected_json_keys - actual_keys
            extra_keys = actual_keys - cls.expected_json_keys
            logging.warning(
                f"Key mismatch for {cls.__name__}. "
                f"Missing keys: {missing_keys}, Extra keys: {extra_keys}."
            )
        obj = json.loads(json_string, object_hook=cls.json_decoder)

@diogomart
Copy link
Contributor

Yeah we could re-raise the error, but I think that's just more boilerplate code for the same kind of traceback.

For the polymer construction we are catching more errors to provide more elaborate messages to the user. And to report all failures (to match templates) at once, then the user can act on all of them instead of having to run mk_prepare_receptor.py over and over again. But that is not a feature we need here.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Jan 9, 2025

We can just remove try/except, but it will not be able to distinguish different types of error

But that is not a feature we need here.

(I might have written a request in an open issue for this, but couldn't find it at the moment.. maybe I'm the only one who wants this, so I implemented it)

Try/except is always over-handling but I think it helps to know the general type of problem before exposing the traceback. I'm ok with no try/except as it requires minimal changes. But I can also do more work to print the traceback, if that's helpful.
We can talk more when we meet!

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Jan 9, 2025

I think all json-bound errors should be raised at from_json (because errors from lower level or nested object are deferred here). But let me think about it a bit more. The current state of this PR is also not the best way to handle this

@diogomart
Copy link
Contributor

Yeah, if you want the messages from the try/except keep the current try/except code, but add the detailed traceback by re-raising the error (or something similar) so we can more easily debug things in the future. In the example I posted above, the line we really need to know about is

  File "/home/diogom/Work/code/Meeko/meeko/molsetup.py", line 249, in _decode_object
    pdbinfo = PDBAtomInfo(*obj["pdbinfo"])

we must still expose this detailed traceback.

@diogomart
Copy link
Contributor

The simplest way might be to re-reaise, by adding, for example, from decoded_error

            except ValueError as decoder_error:
                raise RuntimeError(
                    f"An error occurred when creating {cls.__name__} from JSON."
                    f"Error: {decoder_error}"
                ) from decoder_error

@diogomart
Copy link
Contributor

@rwxayheee Nevermind... the traceback is already printed in the PR as is, I missed while testing and got confused. Please ignore this last few messages, sorry.

@diogomart
Copy link
Contributor

When trying to decode with a missing attribute we get an error as expected:

from meeko.molsetup import Bond
import json

bond = Bond(42, 43, rotatable=True)
json_bond = bond.to_json()
dict_bond = json.loads(json_bond)
dict_bond.pop("rotatable")
try:
    decoded_bond = Bond.from_json(json.dumps(dict_bond))
    print("Successfully decoded Bond without rotatable attribute")
except Exception as err:
    print("Failed at decoding Bond without rotatable attribute")
    print(err)

We popped the "rotatable" attribute out of the JSON, so we get an error. All good here.

However, if we do the same within a molecule setup, then we get no error:

from meeko import RDKitMoleculeSetup
from meeko import MoleculePreparation
from scrubber import Scrub
from rdkit import Chem
import json


mk_prep = MoleculePreparation()
scrub = Scrub()

butanol = Chem.MolFromSmiles("OCCC")
butanol = scrub(butanol)[0]
molsetup = mk_prep(butanol)[0]
for bond_id, bond in molsetup.bond_info.items():
    print(f"{bond_id=} {type(bond).__name__}")

data = json.loads(molsetup.to_json())
for bond_id, bond in data["bond_info"].items(): 
    bond.pop("rotatable")
decoded_molsetup = RDKitMoleculeSetup.from_json(json.dumps(data))

print("Bonds in decoded molsetup are dict, not type Bond")
for bond_id, bond in decoded_molsetup.bond_info.items():
    print(f"{bond_id=} {type(bond).__name__}")

This prints

bond_id=(0, 1) Bond
bond_id=(1, 2) Bond
bond_id=(2, 3) Bond
bond_id=(0, 4) Bond
bond_id=(1, 5) Bond
bond_id=(1, 6) Bond
bond_id=(2, 7) Bond
bond_id=(2, 8) Bond
bond_id=(3, 9) Bond
bond_id=(3, 10) Bond
bond_id=(3, 11) Bond
Bonds in decoded molsetup are dict, not type Bond
bond_id=(0, 1) dict
bond_id=(1, 2) dict
bond_id=(2, 3) dict
bond_id=(0, 4) dict
bond_id=(1, 5) dict
bond_id=(1, 6) dict
bond_id=(2, 7) dict
bond_id=(2, 8) dict
bond_id=(3, 9) dict
bond_id=(3, 10) dict
bond_id=(3, 11) dict

We should still be getting an error.

@rwxayheee
Copy link
Contributor Author

Hi @diogomart
Thanks for the review. Sounds like we're able to see the error raised from a nested object. I recall that we should have the traceback from the point where the error was raised. There's room to improve if other people need more detailed error, for now I will not modify the current PR at the moment, but please let me know if there's anything else I can do.

One last note on this is that this PR doesn't include #254. After we merge this, I can resubmit another PR in place of #254.

@diogomart
Copy link
Contributor

diogomart commented Jan 10, 2025

I already merged this PR into 254, no worries about that one. But could you take a look at why removing an attribute from a JSON string and passing it to the from_json constructor of the respective class raises an error, for example removing the rotatable attribute of a bond and passing the JSON string to Bond.from_json(), but we don't get an error if we remove the rotatable attribute from the bonds of a molecule setup's JSON and pass it to RDKitMoleculeSetup.from_json().
edit: see #292 (comment) above

@rwxayheee
Copy link
Contributor Author

@diogomart I get what you mean now, sorry I misunderstood the comment.
There's some quick way to fix (replacing json_decoder by from_dict or from_json with options), but allow me to reconsider the check_keys option. I'm not very sure if I implemented it in the way I want..

@rwxayheee
Copy link
Contributor Author

Hi @diogomart
761e829 replaced all json_decoder by from_dict to decode a nested object. This way any problems with decoding the nested object will be directly raised from where they are from. In the example you provided, the missing key rotatable triggers an error from here:

  File "/Users/amyhe/Desktop/0_forks/Meeko/meeko/molsetup.py", line 306, in _decode_object
    rotatable = obj["rotatable"]
KeyError: 'rotatable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/amyhe/Downloads/test_json_pr.py", line 21, in <module>
    decoded_molsetup = RDKitMoleculeSetup.from_json(json.dumps(data))
  File "/Users/amyhe/Desktop/0_forks/Meeko/meeko/utils/jsonutils.py", line 168, in from_json
    raise RuntimeError(
RuntimeError: Unable to load JSON string for RDKitMoleculeSetup. Error: 'rotatable'

It relies on directly accessing a keyword from dict, not obj.get. As you suggested before, obj.get should be avoided as much as possible.

This is the minimal fix (without implementing more sophisticated error passing functions) I can think of. I will stay with the current usage of check_keys as it prevents unexpected decoding of nested objects.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Jan 10, 2025

Just one more note - I feel it would be helpful to have some function for each class to validate if the nested objects are the correct type. The type-validation function is not strictly JSON-bound, could be useful in a more general way.
Currently it's possible in Python to assign anything (not the expected type) to a nested object, not just from JSON.

Happy to work on it in a new issue or PR!

@diogomart
Copy link
Contributor

The real problem might be.... Python itself with dynamic typing.

@rwxayheee
Copy link
Contributor Author

Thanks for the approval! Merging to develop

@rwxayheee rwxayheee merged commit 7f6e27a into forlilab:develop Jan 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants