Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization for System #673

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Serialization for System #673

wants to merge 4 commits into from

Conversation

frostedoyster
Copy link
Contributor

@frostedoyster frostedoyster commented Jul 4, 2024

Implements serialization for metatensor.torch.atomistic.System

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

Reviewer checklist

  • CHANGELOG updated with public API or any other important changes?

馃摎 Download documentation preview for this pull-request

@frostedoyster
Copy link
Contributor Author

@Luthaf is this what you were thinking of?
If not, does torch provide something like Tensor to json and back?

@frostedoyster frostedoyster requested a review from Luthaf July 4, 2024 09:30
@Luthaf
Copy link
Contributor

Luthaf commented Jul 4, 2024

I would rather not use JSON for this, or at the very least only use JSON for the container and use a binary serialization format for the data (i.e. the torch Tensor and the TensorBlock inside the system).

We should be able to access whatever format torch is using internally to save Tensor, and use serialization to a buffer to save metatensor data.

For the container format, JSON is fine, but not ideal. The overall thing might look like

{
    "positions": "VERY LONG BINARY STRING ...6251690352651719",
    "cell": "VERY LONG BINARY STRING ...6251690352651719",
    "types": "VERY LONG BINARY STRING ...6251690352651719",
    "neighbors": [
        {"options": "JSON_OF_OPTIONS", "values": "VERY LONG BINARY STRING ...6251690352651719"},
        {"options": "JSON_OF_OPTIONS", "values": "VERY LONG BINARY STRING ...6251690352651719"}
    ],
    "data": {
        "name-1": "VERY LONG BINARY STRING ...6251690352651719",
        "name-2": "VERY LONG BINARY STRING ...6251690352651719",
    }
}

The main problem then is that encoding a binary string to JSON might require some tranformations of the string to escape some of them.

An alternative would be to go for ubjson, which should be supported by our JSON library. We might need to be careful to make sure the serialization uses a proper binary format to store our binary data (they seem to use list of integers in this case). The main drawback here is that we would now have another, different serialization format.

@frostedoyster
Copy link
Contributor Author

Hmm ok, in that case I think it's better if you do it because it looks complicated to me.

Another possibility is to merge something like this for now and change it later. The main use of this would be loading from disk for a large dataset and, although this approach is potentially slow, I'm hoping that it will still be faster than the model that is being trained (data loading can happen asynchronously and in parallel, see num_workers and related for the torch Dataloader)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants