Improve serialization memory footprint #176

rmanoka · 2022-01-17T17:48:46Z

Currently, we convert Geojson to a Map<String, serde_json::Value> to serialize. This seems to be quite wasteful with memory (and probably also compute-time). For instance, when serializing a large set of geoms, the serialization alone takes up more than 10x the memory required to store the final representation.

I suggest we move out of this approach, in favour of a direct serialization via derive macros / custom impl.. From a quick skim, it looks like only Value requires a custom impl, and the rest can be done via serde_derive (thanks to the flexibility available therein).

Once we move to this, we could also get rid of the JsonObject and all related TryFrom / From impls. Note that serde_json already provides {to,from}_value that can convert any serializable type to/from JsonValue. Is there any additional value (no pun intended) in keeping the impls.?

One minor concern: the docs suggest that serde is optional, but to the best of my understanding, it is a very crucial part of the crate. Am I missing something?

(Issue summarized from discord discussion.)

The text was updated successfully, but these errors were encountered:

urschrei · 2022-01-18T15:36:07Z

I can't picture how this looks, but this sounds like a solid approach and I'm happy to help once I know what to do (and pending the resolution of the serde question, also mentioned in #172

rmanoka · 2022-01-19T07:43:17Z

@urschrei The idea is to use the serde_derive macros and adjust the #[serde()] attributes on the data-structures to align the output with the specs.

So, for instance, the main enum GeoJson would now use the internally-tagged repr.:

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum GeoJson {
    Geometry(Geometry),
    Feature(Feature),
    FeatureCollection(FeatureCollection),
}

The FeatureCollection would look something like (haven't checked foreign members and bbox yet):

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct FeatureCollection {
    /// Bounding Box
    ///
    /// [GeoJSON Format Specification § 5](https://tools.ietf.org/html/rfc7946#section-5)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub bbox: Option<Bbox>,
    pub features: Vec<Feature>,
    /// Foreign Members
    ///
    /// [GeoJSON Format Specification § 6](https://tools.ietf.org/html/rfc7946#section-6)
    #[serde(skip_serializing_if = "Option::is_none", flatten)]
    pub foreign_members: Option<JsonObject>,
}

These two already fix the memory issues (which is mainly caused by large FeatureCollections)

rmanoka · 2022-01-21T17:53:40Z

So, error handling of serde_json (and may be in general in serde) is a bit inflexible than I hoped. Custom errors are reported in impl Serialize, Deserialize only via conversion to a string (see here). This means, we lose the ability to keep our Error enum for semantic errors (like Feature is missing "type", etc.); we can only pass the error message as a string.

I had hoped it would allow Box<dyn Something> that can be down-casted back by the user to our error enum. This means, that we have to choose between our error enum, and a memory-efficient {de-,}serializer. :(

The serde_json issue on memory usage also points to a memory-efficient Value alternative: ijson. We could just impl. conversions from and into this type so users can opt for this as an alternative. The crate does use some pretty crazy unsafe-code to make the type exactly pointer sized. It also does mean that we have to repeat all our conversion code twice unfortunately.

bjornharrtell · 2022-03-19T22:13:27Z

@rmanoka are you sure about that? I interpret https://serde.rs/error-handling.html as a "data format" can define it's own Error.

rmanoka · 2022-03-20T13:06:35Z

The data format can indeed define their own error type; here data format we intend to use is serde_json (via the general serde traits). However, what we want is to somehow use our existing error struct (i.e. geojson::Error), through the serde framework; this doesn't seem possible without conversion into strings and back.

bjornharrtell · 2022-03-22T17:35:34Z

@rmanoka I think I understand, so I guess the more "optimal" way would be to implement a more serde native serde_geojson. But that is probably another project completely. However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

urschrei · 2022-03-22T17:48:12Z

However, it might be an excuse that more rudimentary error handling could be acceptable with the current approach.

I'm inclined to agree – in the case of GeoJSON handling, an order-of-magnitude increase is worth the cost of less informative error messages IMO.

michaelkirk · 2022-09-02T22:05:06Z

The existing API's serializing a FeatureCollection directly is still expensive serde_json::to_string(FeatureCollection), but there are now alternatives like FeatureWriter and FeatureReader which should work for many use cases, while consuming much less memory.

see #199, #205, and #206

urschrei mentioned this issue Jul 4, 2022

Reduce size of Error enum #197

Open

urschrei mentioned this issue Aug 20, 2022

InvalidGeometryConversion should have "expected" vs. "found" types. #201

Closed

rmanoka mentioned this issue Aug 25, 2022

GeoJSON to/from custom struct using serde #199

Merged

2 tasks

urschrei mentioned this issue Aug 12, 2024

Release 1.0 #244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve serialization memory footprint #176

Improve serialization memory footprint #176

rmanoka commented Jan 17, 2022

urschrei commented Jan 18, 2022

rmanoka commented Jan 19, 2022

rmanoka commented Jan 21, 2022

bjornharrtell commented Mar 19, 2022

rmanoka commented Mar 20, 2022

bjornharrtell commented Mar 22, 2022 •

edited

Loading

urschrei commented Mar 22, 2022

michaelkirk commented Sep 2, 2022

Improve serialization memory footprint #176

Improve serialization memory footprint #176

Comments

rmanoka commented Jan 17, 2022

urschrei commented Jan 18, 2022

rmanoka commented Jan 19, 2022

rmanoka commented Jan 21, 2022

bjornharrtell commented Mar 19, 2022

rmanoka commented Mar 20, 2022

bjornharrtell commented Mar 22, 2022 • edited Loading

urschrei commented Mar 22, 2022

michaelkirk commented Sep 2, 2022

bjornharrtell commented Mar 22, 2022 •

edited

Loading