Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Overturemap parquet files #849

Closed
bchapuis opened this issue Apr 16, 2024 · 10 comments
Closed

Add support for Overturemap parquet files #849

bchapuis opened this issue Apr 16, 2024 · 10 comments

Comments

@bchapuis
Copy link
Member

https://github.com/OvertureMaps/data

@bchapuis
Copy link
Member Author

bchapuis commented May 1, 2024

@sebr72 as discussed, I'm not really satisfied with my current experiment in the overturemap branch. The geoparquet format contains semi structured data which require some changes in the DataTable abstraction. Also, it requires a deep understanding of the geoparquet format.

One avenue (probably the best) could be to use the parser available in sedona (the project is written in scala):
https://sedona.apache.org/latest-snapshot/tutorial/sql/#__tabbed_9_2

Another avenue could be to build upon my throw-away overturemaps branch, but I'm not sure about the effort needed to have something robust.

In both cases, adding parquet or sedona will result in a lot of new dependencies (hadoop, spark).

@bchapuis
Copy link
Member Author

bchapuis commented May 1, 2024

@sebr72 There may also be a third option which is to rely on parquet support in postgresql. I have no experience with this extention.
https://github.com/adjust/parquet_fdw

@sebr72
Copy link
Contributor

sebr72 commented May 15, 2024

@bchapuis I had a look at Sedona and I highlight the following:

  1. Large project mainly relying on Spark or Flink (large project themselves)
  2. The java examples are around Flink which is a lot faster than Spark but it is not directly linked to Geoparquet
  3. The geoparquet implementation is in Scala and geared around Spark
  4. Combining Spark Sedona and Scala with baremaps will end up in an "expensive" integration for Geoparquet.

I am going to switch to have a look at:
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet

@bchapuis
Copy link
Member Author

Yes, I think the suggestion of @Drabble to look into drill is a good idea. We can probably either use it or get inspiration from it for our own implementation.

@bchapuis
Copy link
Member Author

bchapuis commented May 30, 2024

@sebr72 @Drabble I will merge the current PR and organize the git history to have three separated commits with our individual contributions. For the following tasks, I suggest we make individual PRs and split the work more clearly.

  • Cleanup the sonar problems when it makes sense (@sebr72 ).
  • Add some unit tests to the GeoParquetReader (@sebr72 ).
  • Define a better organization for the packages and classes of the geoparquet module (data, hadoop, etc.).
  • Improve the allocation of objects in the GeoParquet reader (list for each field, wrapper for each value, etc.)
  • The high level abstration (DataTable, DataColumn, etc.) needs support for nested data structures such as groups in geoparquet and json in other data formats (Improve naming in data frame abstraction #857, Add support for nested types, geoparquet groups, and postgres jsonb in data table #860).
  • Improve the abstraction of the GeoParquetGroup (the current version uses internal classes, and the getters/setters were quickly definined to have an end-to-end example, etc.) - after doing a pass on the DataTable abstraction

@bchapuis
Copy link
Member Author

@sebr72 @Drabble I merged the changes and we can now continue with individual PRs.

@Drabble
Copy link
Contributor

Drabble commented Jun 1, 2024

@bchapuis Great job on the pull request! I will look at your new one for nested groups.

I would be really interested in making an example to go from Overture data on S3 to serving MVT to a Maputnik frontend.

I think this would mean:

1 Fix the code to be able to use a S3 url directly. E.g. s3a://overturemaps-us-west-2/release/2024-05-16-beta.0/theme=admins/type=/
2. Use the GeoParquetDataTable to write Overture data into Postgresql using a ProjectionTransformer to go from EPSG:4326 to EPSG:3857
3. Create a geospatial index for the geometry column
4. Create a materialised view to group the columns into a TAGS jsonb field and maybe simplifications for different zoom levels
5. Make a simple style.json and tileset.json to serve the data

What do you think?

@bchapuis
Copy link
Member Author

bchapuis commented Jun 1, 2024

Yes, the plan sounds good and can probably be addressed with multiple PRs. Maybe we can skip step 4 or use views instead of materialized views. As the daylight distribution with soon be deprecated and replaced by overturemaps, an idea could be to copy the daylight directory and use it as a basis.

@Drabble
Copy link
Contributor

Drabble commented Jun 14, 2024

We have a basic support for Overture maps now. Should we consider this issue closed and raise more issues for further improvements to the Overture maps library?

@fgravin
Copy link
Contributor

fgravin commented Jun 14, 2024

Congratz guys 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants