-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a base class for classes with JSON-interchange functions #292
Conversation
This is done, but open to any changes it may need. Here's a very brief summary: Implemented code structure in this PR
expected_json_keys: Optional[frozenset[str]] = None
@classmethod
def json_encoder(cls, obj):
raise NotImplementedError("Subclasses must implement json_encoder.")
@classmethod
def _decode_object(cls, obj):
raise NotImplementedError("Subclasses must implement _decode_object.")
@classmethod
def json_decoder(cls, obj: dict[str, Any], check_keys: bool = True):
# Avoid using json_decoder as object_hook for nested objects
if not isinstance(obj, dict):
return obj
if check_keys and not cls.expected_json_keys.issubset(obj.keys()):
return obj
# Delegate specific decoding logic to a subclass-defined method
return cls._decode_object(obj)
@classmethod
def from_json(cls, json_string: str):
try:
# Log mismatched keys if non-critical
obj_dict = json.loads(json_string)
actual_keys = obj_dict.keys()
if actual_keys != cls.expected_json_keys:
missing_keys = cls.expected_json_keys - actual_keys
extra_keys = actual_keys - cls.expected_json_keys
logging.warning(
f"Key mismatch for {cls.__name__}. "
f"Missing keys: {missing_keys}, Extra keys: {extra_keys}."
)
try:
obj = json.loads(json_string, object_hook=cls.json_decoder)
# Error occurred within cls.json_decoder
except ValueError as decoder_error:
raise RuntimeError(
f"An error occurred when creating {cls.__name__} from JSON."
f"Error: {decoder_error}"
)
# Validate the decoded object type
if not isinstance(obj, cls):
raise ValueError(
f"Decoded object is not an instance of {cls.__name__}. "
f"Got: {type(obj)}"
)
return obj
# Error occurred within json.loads
except Exception as parser_error:
raise RuntimeError(
f"Unable to load JSON string for {cls.__name__}. "
f"Error: {parser_error}"
)
@classmethod
def from_dict(cls, obj: dict) -> "BaseJSONParsable":
return cls.json_decoder(obj, check_keys=False)
def to_json(self):
return json.dumps(self, default=self.__class__.json_encoder) Function
|
Usage note at fdc08f7 (same in 252bc95) 1. Export to JSON: Without explicitly naming the encoderThis is a changed usage, aiming to help ensure the correct encoding logic is always used. Before with open(filename, "w") as f:
json.dump(polymer, f, cls=PolymerEncoder) After with open(filename, "w") as f:
f.write(polymer.to_json()) 2. Create from JSON: With specific decoder from subclassThis is relatively unchanged from the original implementation (which also requires the name of expected class). The name of the function is now shorter. Before decoded_polymer = json.loads(
json_str, object_hook=polymer.polymer_json_decoder
) After decoded_polymer = Polymer.from_json(json_str) 3. Override the "Correct" decoder/encoderIt seems possible to override the corresponding decoder/encoder: starting_molsetup = populated_rdkit_molsetup
# full JSON dump (correct)
json_str = starting_molsetup.to_json()
# partial JSON dump, as the RDKitMoleculeSetup is an extended MoleculeSetup
json_str2 = json.dumps(starting_molsetup, default=MoleculeSetup.json_encoder, skipkeys=True)
# creates RDKitMoleculeSetup (correct)
decoded_molsetup = RDKitMoleculeSetup.from_json(json_str)
# creates a MoleculeSetup
decoded_molsetup2 = json.loads(json_str, object_hook = MoleculeSetup.json_decoder)
# creates a MoleculeSetup
decoded_molsetup3 = MoleculeSetup.from_dict(json.loads(json_str)) 4. Lenient Parsing (from_dict, from_json_file, from_config)
5. Changes of certain keysTo re-use the decoder/encoder, there's some minor change in the spelling of expected keys. They were changed to be consistent with the existing JSON files in the meeko/data directory. class ResidueTemplate:
expected_json_keys = {"mol", "link_labels", "atom_name"} # was atom_names class ResiduePadder:
expected_json_keys = {
"rxn_smarts",
"adjacent_res_smarts", # was adjacent_smarts
"auto_blunt",
} @diogomart please let me know if you like this. It does not introduce new features but this has been a good experience for me to review the JSON-bound properties. I understand more about the data structure now. |
if not isinstance(obj, dict): | ||
return obj | ||
if check_keys and not cls.expected_json_keys.issubset(obj.keys()): | ||
return obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the original code had this already, but we should try to raise an error instead of returning the input obj
. Returning the input obj
makes errors very hard to find as it just fails silently. Unless there is a purpose in returning the input obj
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I said two things wrong in the comment above (I removed it already). Correction: (1) The check of returned object must be placed else where, if json_decoder
needs to be used as object_hook
. Otherwise, nested objects will be decoded with the same object_hook. (2) the check is deferred to the higher level function from_json
in base class, not class-specific _decode_object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code structure is explained here:
#292 (comment)
Function
_decode_object
focuses on the decoding logic, whilefrom_json
is on a higher level and is more for error handling. On the intermediate level, there'sjson_decoder
as object_hook and it defers nested objects that can't be decoded by the current _decode_object logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - thanks for clarifying.
meeko/polymer.py
Outdated
def json_encoder(cls, obj: "ResidueTemplate") -> Optional[dict[str, Any]]: | ||
output_dict = { | ||
"mol": rdMolInterchange.MolToJSON(obj.mol), | ||
"link_labels": obj.link_labels, | ||
"atom_name": obj.atom_names, | ||
} | ||
return output_dict | ||
|
||
# Keys to check for deserialized JSON | ||
expected_json_keys = {"mol", "link_labels", "atom_name"} | ||
|
||
@classmethod | ||
def _decode_object(cls, obj: dict[str, Any]): | ||
|
||
# Converting ResidueTemplate init values that need conversion | ||
deserialized_mol = rdkit_mol_from_json(obj.get("mol")) | ||
# do not write canonical smiles to preserve original atom order | ||
if deserialized_mol: | ||
mol_smiles = rdkit.Chem.MolToSmiles(deserialized_mol, canonical=False) | ||
# if dry json (data) is supplied | ||
else: | ||
mol_smiles = obj.get("smiles") | ||
|
||
link_labels = convert_to_int_keyed_dict(obj.get("link_labels")) | ||
|
||
# Construct a ResidueTemplate object | ||
residue_template = cls(mol_smiles, None, obj.get("atom_name")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming atom_names
to atom_name
makes it fail to read polymer JSON written by meeko v0.6.1. I thing this was to reuse the from_dict
function from the base class to load the default templates, which is great, but we must be able to read polymer JSON from v0.6.1. I'll push a (failing) test for this.
I'm also concerned about the obj.get(key)
vs obj[key]
. I think it's better to have a missing key error, which is easy to debug, than some trouble with an unexpected None with a more complex error traceback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj.get()
here are being forgiving for (manually prepared) templates that are missing properties. Could we allow missing keys ("link_labels" and/or "atom_name") in template files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys are checked in from_json
. The missing or excess keys will be logged from there. If it causes a failure of object creation it will throw an RuntimeError, after attempting to create the obj. Otherwise it will just be logged. I use obj.get()
for some keys (? maybe not all, that's bad - I should have made these more visible and consistent) in the decoding logic to allow lenient parsing. In other words, the error handling is in from_json
. The class-specific decoding function _decode_object
is generally forgiving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer an early return (no attempt to create obj at all) for missing keys? Currently the mismatched keys are logged if non-critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, as as you added in comment below, just need to keep backwards compatibility.
Hi @diogomart When we're creating obj from JSON: Do you prefer an early return (no attempt to create obj at all) for missing keys? Currently the mismatched keys are logged if non-critical. The use of |
At present, I use (1) The keys are optional for lenient parsing. (2) The keys correspond to a computed attribute, not an initializer attribute. Thus, the keyword is optional for the |
meeko/utils/jsonutils.py
Outdated
|
||
@classmethod | ||
def json_encoder(cls, obj): | ||
raise NotImplementedError("Subclasses must implement json_encoder.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractmethod
(happens to be a github username..) is better here. If we are doing a revision I will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
|
in the corresponding _decode_object functions
include name of the class in the logging
currently only use for auto_bount, or when necessary for parsing templates
via access_with_deprecated_key
Changes since 681a795
old key: To address this, I introduced a new class function,
Sorry there were two changes I forgot to add in 8273d5f. @diogomart This is ready, please let me know if it works! Thanks for your time and kind advice in advance |
I think the try/except statements in from meeko.molsetup import Atom
a = Atom(42)
b = Atom.from_json(a.to_json()) If I remove all try/except and just execute the code, I get to know that the error occured here:
but the try/except wrap the error traceback and it's harder to find the problem. Is there an advantage in the try/except? |
Hi @diogomart Can you show which try/except wraps the traceback? It's possible to get the traceback while still using try/except. There's a nested try/except to distinguish different types of errors (see comments in code). But i'm not sure if this is the best approach to handle this. Can you suggest an alternate approach? |
We discussed a similar issue before, personally I don't have the habit of printing traceback (so i forgot to take care of this, I should though..), but I recall that we agreed on doing something like this, for the PolymerCreationError: Lines 484 to 495 in 07a38f8
Would you like the same handling, or do you prefer just not using try/except? To trace error raised from the decoding/encoding of a nested object it might need a little more work, but I could try |
I would just remove the try/except in from_json() in jsonutils.py, then it looks like this (about half the size) @classmethod
def from_json(cls, json_string: str):
# Log mismatched keys if non-critical
obj_dict = json.loads(json_string)
actual_keys = obj_dict.keys()
if actual_keys != cls.expected_json_keys:
missing_keys = cls.expected_json_keys - actual_keys
extra_keys = actual_keys - cls.expected_json_keys
logging.warning(
f"Key mismatch for {cls.__name__}. "
f"Missing keys: {missing_keys}, Extra keys: {extra_keys}."
)
obj = json.loads(json_string, object_hook=cls.json_decoder) |
Yeah we could re-raise the error, but I think that's just more boilerplate code for the same kind of traceback. For the polymer construction we are catching more errors to provide more elaborate messages to the user. And to report all failures (to match templates) at once, then the user can act on all of them instead of having to run mk_prepare_receptor.py over and over again. But that is not a feature we need here. |
We can just remove try/except, but it will not be able to distinguish different types of error
(I might have written a request in an open issue for this, but couldn't find it at the moment.. maybe I'm the only one who wants this, so I implemented it) Try/except is always over-handling but I think it helps to know the general type of problem before exposing the traceback. I'm ok with no try/except as it requires minimal changes. But I can also do more work to print the traceback, if that's helpful. |
I think all json-bound errors should be raised at from_json (because errors from lower level or nested object are deferred here). But let me think about it a bit more. The current state of this PR is also not the best way to handle this |
Yeah, if you want the messages from the try/except keep the current try/except code, but add the detailed traceback by re-raising the error (or something similar) so we can more easily debug things in the future. In the example I posted above, the line we really need to know about is
we must still expose this detailed traceback. |
The simplest way might be to re-reaise, by adding, for example, except ValueError as decoder_error:
raise RuntimeError(
f"An error occurred when creating {cls.__name__} from JSON."
f"Error: {decoder_error}"
) from decoder_error |
@rwxayheee Nevermind... the traceback is already printed in the PR as is, I missed while testing and got confused. Please ignore this last few messages, sorry. |
When trying to decode with a missing attribute we get an error as expected: from meeko.molsetup import Bond
import json
bond = Bond(42, 43, rotatable=True)
json_bond = bond.to_json()
dict_bond = json.loads(json_bond)
dict_bond.pop("rotatable")
try:
decoded_bond = Bond.from_json(json.dumps(dict_bond))
print("Successfully decoded Bond without rotatable attribute")
except Exception as err:
print("Failed at decoding Bond without rotatable attribute")
print(err) We popped the "rotatable" attribute out of the JSON, so we get an error. All good here. However, if we do the same within a molecule setup, then we get no error: from meeko import RDKitMoleculeSetup
from meeko import MoleculePreparation
from scrubber import Scrub
from rdkit import Chem
import json
mk_prep = MoleculePreparation()
scrub = Scrub()
butanol = Chem.MolFromSmiles("OCCC")
butanol = scrub(butanol)[0]
molsetup = mk_prep(butanol)[0]
for bond_id, bond in molsetup.bond_info.items():
print(f"{bond_id=} {type(bond).__name__}")
data = json.loads(molsetup.to_json())
for bond_id, bond in data["bond_info"].items():
bond.pop("rotatable")
decoded_molsetup = RDKitMoleculeSetup.from_json(json.dumps(data))
print("Bonds in decoded molsetup are dict, not type Bond")
for bond_id, bond in decoded_molsetup.bond_info.items():
print(f"{bond_id=} {type(bond).__name__}") This prints
We should still be getting an error.
|
Hi @diogomart One last note on this is that this PR doesn't include #254. After we merge this, I can resubmit another PR in place of #254. |
I already merged this PR into 254, no worries about that one. But could you take a look at why removing an attribute from a JSON string and passing it to the from_json constructor of the respective class raises an error, for example removing the |
@diogomart I get what you mean now, sorry I misunderstood the comment. |
Hi @diogomart
It relies on directly accessing a keyword from dict, not This is the minimal fix (without implementing more sophisticated error passing functions) I can think of. I will stay with the current usage of |
Just one more note - I feel it would be helpful to have some function for each class to validate if the nested objects are the correct type. The type-validation function is not strictly JSON-bound, could be useful in a more general way. Happy to work on it in a new issue or PR! |
The real problem might be.... Python itself with dynamic typing. |
Thanks for the approval! Merging to develop |
This PR introduces a base class for many other existing classes that need the JSON-interchange functionalities. The common, highe-level functions (from_json, to_json) are inherited from the base class. Whereas the specific encoding-decoding logics are defined within the individual subclass scope.
The major objective is to reduce code duplication, improve error handling, and make future maintenance easier. This may or may not be an over-engineering of simple things that are already working, but I thought it would be really useful when I opened #279. I hope this can also help with the consistency issue in #288.
Introduce
BaseJSONParsable
at 5ec1bf7This working checkpoint shows the basic layout of class
BaseJSONParsable
. Note thatRDKitMoleculeSetup
is still a subclass ofMoleculeSetup
. For subclasses with additional inheritance relationships like this, the decoder/encoder of the parent may be used in the child class.ResidueChemTemplates: Harmonize from_json and from (data) json at 1cd7a77
This working checkpoint shows how the parameter data JSON string and the standard serialized JSON of ResidueChemTemplates may be parsed in the same way.
MoleculePreparation: harmonize from_config and from (deserialized json object) dict at 252bc95
For fix from_json methods, add from_json_file and from_dict methods #288 (comment)
Revisit optional_keys
Future changes like add cycle_break attribute to molsetup Bond #254 should be possible by changes made to init and the JSON-interchange region in the subclass.