Serialization for `System` #673

frostedoyster · 2024-07-04T09:27:38Z

Implements serialization for metatensor.torch.atomistic.System

Contributor (creator of pull-request) checklist

Tests updated (for new features and bugfixes)?
Documentation updated (for new features)?
Issue referenced (for PRs that solve an issue)?

Reviewer checklist

CHANGELOG updated with public API or any other important changes?

📚 Download documentation preview for this pull-request

frostedoyster · 2024-07-04T09:29:35Z

@Luthaf is this what you were thinking of?
If not, does torch provide something like Tensor to json and back?

Luthaf · 2024-07-04T09:53:04Z

I would rather not use JSON for this, or at the very least only use JSON for the container and use a binary serialization format for the data (i.e. the torch Tensor and the TensorBlock inside the system).

We should be able to access whatever format torch is using internally to save Tensor, and use serialization to a buffer to save metatensor data.

For the container format, JSON is fine, but not ideal. The overall thing might look like

{
    "positions": "VERY LONG BINARY STRING ...6251690352651719",
    "cell": "VERY LONG BINARY STRING ...6251690352651719",
    "types": "VERY LONG BINARY STRING ...6251690352651719",
    "neighbors": [
        {"options": "JSON_OF_OPTIONS", "values": "VERY LONG BINARY STRING ...6251690352651719"},
        {"options": "JSON_OF_OPTIONS", "values": "VERY LONG BINARY STRING ...6251690352651719"}
    ],
    "data": {
        "name-1": "VERY LONG BINARY STRING ...6251690352651719",
        "name-2": "VERY LONG BINARY STRING ...6251690352651719",
    }
}

The main problem then is that encoding a binary string to JSON might require some tranformations of the string to escape some of them.

An alternative would be to go for ubjson, which should be supported by our JSON library. We might need to be careful to make sure the serialization uses a proper binary format to store our binary data (they seem to use list of integers in this case). The main drawback here is that we would now have another, different serialization format.

frostedoyster · 2024-07-04T10:19:47Z

Hmm ok, in that case I think it's better if you do it because it looks complicated to me.

Another possibility is to merge something like this for now and change it later. The main use of this would be loading from disk for a large dataset and, although this approach is potentially slow, I'm hoping that it will still be faster than the model that is being trained (data loading can happen asynchronously and in parallel, see num_workers and related for the torch Dataloader)

Serialization for System (without neighborlist)

f41f4e3

frostedoyster requested a review from Luthaf July 4, 2024 09:30

Include neighbor lists

2b688a6

frostedoyster force-pushed the system-serialize branch from 9bc3009 to 2b688a6 Compare July 4, 2024 12:26

frostedoyster added 2 commits July 9, 2024 19:26

Merge branch 'master' into system-serialize

57d123c

Merge branch 'master' into system-serialize

876812e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization for `System` #673

Serialization for `System` #673

frostedoyster commented Jul 4, 2024 •

edited by github-actions bot

Loading

frostedoyster commented Jul 4, 2024

Luthaf commented Jul 4, 2024

frostedoyster commented Jul 4, 2024

Serialization for System #673

Are you sure you want to change the base?

Serialization for System #673

Conversation

frostedoyster commented Jul 4, 2024 • edited by github-actions bot Loading

Contributor (creator of pull-request) checklist

Reviewer checklist

frostedoyster commented Jul 4, 2024

Luthaf commented Jul 4, 2024

frostedoyster commented Jul 4, 2024

Serialization for `System` #673

Serialization for `System` #673

frostedoyster commented Jul 4, 2024 •

edited by github-actions bot

Loading