Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve serialization memory footprint #176

Open
rmanoka opened this issue Jan 17, 2022 · 8 comments
Open

Improve serialization memory footprint #176

rmanoka opened this issue Jan 17, 2022 · 8 comments

Comments

@rmanoka
Copy link
Contributor

rmanoka commented Jan 17, 2022

Currently, we convert Geojson to a Map<String, serde_json::Value> to serialize. This seems to be quite wasteful with memory (and probably also compute-time). For instance, when serializing a large set of geoms, the serialization alone takes up more than 10x the memory required to store the final representation.

I suggest we move out of this approach, in favour of a direct serialization via derive macros / custom impl.. From a quick skim, it looks like only Value requires a custom impl, and the rest can be done via serde_derive (thanks to the flexibility available therein).

Once we move to this, we could also get rid of the JsonObject and all related TryFrom / From impls. Note that serde_json already provides {to,from}_value that can convert any serializable type to/from JsonValue. Is there any additional value (no pun intended) in keeping the impls.?

One minor concern: the docs suggest that serde is optional, but to the best of my understanding, it is a very crucial part of the crate. Am I missing something?

(Issue summarized from discord discussion.)

@urschrei
Copy link
Member

I can't picture how this looks, but this sounds like a solid approach and I'm happy to help once I know what to do (and pending the resolution of the serde question, also mentioned in #172

@rmanoka
Copy link
Contributor Author

rmanoka commented Jan 19, 2022

@urschrei The idea is to use the serde_derive macros and adjust the #[serde()] attributes on the data-structures to align the output with the specs.

So, for instance, the main enum GeoJson would now use the internally-tagged repr.:

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum GeoJson {
    Geometry(Geometry),
    Feature(Feature),
    FeatureCollection(FeatureCollection),
}

The FeatureCollection would look something like (haven't checked foreign members and bbox yet):

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct FeatureCollection {
    /// Bounding Box
    ///
    /// [GeoJSON Format Specification § 5](https://tools.ietf.org/html/rfc7946#section-5)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub bbox: Option<Bbox>,
    pub features: Vec<Feature>,
    /// Foreign Members
    ///
    /// [GeoJSON Format Specification § 6](https://tools.ietf.org/html/rfc7946#section-6)
    #[serde(skip_serializing_if = "Option::is_none", flatten)]
    pub foreign_members: Option<JsonObject>,
}

These two already fix the memory issues (which is mainly caused by large FeatureCollections)

@rmanoka
Copy link
Contributor Author

rmanoka commented Jan 21, 2022

So, error handling of serde_json (and may be in general in serde) is a bit inflexible than I hoped. Custom errors are reported in impl Serialize, Deserialize only via conversion to a string (see here). This means, we lose the ability to keep our Error enum for semantic errors (like Feature is missing "type", etc.); we can only pass the error message as a string.

I had hoped it would allow Box<dyn Something> that can be down-casted back by the user to our error enum. This means, that we have to choose between our error enum, and a memory-efficient {de-,}serializer. :(

The serde_json issue on memory usage also points to a memory-efficient Value alternative: ijson. We could just impl. conversions from and into this type so users can opt for this as an alternative. The crate does use some pretty crazy unsafe-code to make the type exactly pointer sized. It also does mean that we have to repeat all our conversion code twice unfortunately.

@bjornharrtell
Copy link
Contributor

@rmanoka are you sure about that? I interpret https://serde.rs/error-handling.html as a "data format" can define it's own Error.

@rmanoka
Copy link
Contributor Author

rmanoka commented Mar 20, 2022

The data format can indeed define their own error type; here data format we intend to use is serde_json (via the general serde traits). However, what we want is to somehow use our existing error struct (i.e. geojson::Error), through the serde framework; this doesn't seem possible without conversion into strings and back.

@bjornharrtell
Copy link
Contributor

bjornharrtell commented Mar 22, 2022

@rmanoka I think I understand, so I guess the more "optimal" way would be to implement a more serde native serde_geojson. But that is probably another project completely. However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

@urschrei
Copy link
Member

However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

I'm inclined to agree – in the case of GeoJSON handling, an order-of-magnitude increase is worth the cost of less informative error messages IMO.

@michaelkirk
Copy link
Member

The existing API's serializing a FeatureCollection directly is still expensive serde_json::to_string(FeatureCollection), but there are now alternatives like FeatureWriter and FeatureReader which should work for many use cases, while consuming much less memory.

see #199, #205, and #206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants