Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing a metadata folder to IcebergTableProviderFactory #840

Open
gruuya opened this issue Dec 24, 2024 · 2 comments
Open

Allow passing a metadata folder to IcebergTableProviderFactory #840

gruuya opened this issue Dec 24, 2024 · 2 comments

Comments

@gruuya
Copy link
Contributor

gruuya commented Dec 24, 2024

Currently when using the TableProviderFactory mechanism from DataFusion one needs to specify the full exact path to the metadata file as the location, e.g.

create external table inventory 
stored as iceberg 
location 's3://iceberg/public/inventory/metadata/00001-97ea515a-2d2f-465d-8c74-8daec5ab0023.metadata.json

I think it would be nice if IcebergTableProviderFactory also supported being pointed to a metadata directory as well

create external table inventory 
stored as iceberg 
location 's3://iceberg/public/inventory/metadata

This would then imply listing and parsing the latest metadata file in that directory (e.g. from V in filenames like <V>-<random-uuid>.metadata.json and maybe the legacy v<V>.metadata.json), as that is likely the overwhelming use case, and using that to build the table. That would improve the flexibility and ergonomics of the integration (e.g. by making quick prototyping much simpler).

@Xuanwo
Copy link
Member

Xuanwo commented Dec 24, 2024

Hi, I have a strong feeling that we need to add catalog support for datafusion. Any ideas?

@gruuya
Copy link
Contributor Author

gruuya commented Dec 24, 2024

we need to add catalog support for datafusion

Hmm, I think that is adequately supported already, no?

impl CatalogProvider for IcebergCatalogProvider {

impl SchemaProvider for IcebergSchemaProvider {

To be clear, i think the DF CREATE EXTERNAL TABLE construct (formalized via TableProviderFactory) is at most loosely coupled to a catalog. In fact the use case is typically just registering a pre-existing table in an ad-hoc manner for some read-only queries. To me this is analogous to FDWs in Postgres.

While it is possible to wire-up the write path for those, it would require implementing TableProvider::insert_into in

impl TableProvider for IcebergTableProvider {
(but I also think this is orthogonal to the ask here).

Contrast that to a regular CREATE TABLE construct which would correspond to a full coupling with a catalog, targeting tables native to the given system. Thus a full life-cycle of the table would need to be tracked, but that is also beyond the scope of this issue.

At least that is how we perceive/use them, would be curious to hear other interpretations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants