Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py-io-impl config propagation #1589

Open
3 tasks
bigluck opened this issue Jan 29, 2025 · 2 comments
Open
3 tasks

py-io-impl config propagation #1589

bigluck opened this issue Jan 29, 2025 · 2 comments

Comments

@bigluck
Copy link
Contributor

bigluck commented Jan 29, 2025

Please describe the bug 🐞

Hey team! 👋 Hope you're doing well!
I've been working with PyIceberg and ran into an interesting situation regarding FileIO implementation configurations.

TLTR: it seems like pyiceberg does not allow to overwrite (locally) some configurations returned by the remote Iceberg REST api, and I would like to understand if this is intended and what is the way to overwrite some of these configurations.

Context

  • The REST API server (/config endpoint) is configured to use PyArrow as the default FileIO implementation
  • The table endpoint returns a configuration specifying a different FileIO implementation (fsspec)
  • I want to force PyArrow locally since I don't have fsspec/s3fs installed (and prefer using PyArrow)

Starting from PR 9868, the Nessie Iceberg endpoint always returns config.py-io-impl=pyiceberg.io.fsspec.FsspecFileIO when querying a table endpoint. While there might be a separate issue in Nessie's implementation (as my server is configured to override the py-io-impl with pyarrow.PyArrowFileIO), I believe there's also a concern in PyIceberg's handling of configuration priorities.

For more info:

Current Behavior

I've created a test case that explores different configuration scenarios. Here's what I'm observing:

pythonCopycatalog = load_catalog("docs", **{"uri": "https://a.b.c.d/iceberg", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})
table = catalog.load_table("my_namespace.my_table")

similarly to what is describe on the official documentation:

Then load the `prod` catalog:
```python
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"docs",
**{
"uri": "http://127.0.0.1:8181",
"s3.endpoint": "http://127.0.0.1:9000",
"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
}
)

The table configuration from the REST API seems to take precedence over both:

  • The server's default configuration
  • Local overrides passed during catalog initialization

This leads to failures when the table endpoint specifies fsspec.FsspecFileIO but s3fs isn't available locally:

ValueError: Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO

Real-world Impact

This configuration priority issue creates practical problems in multi-system setups. Consider this scenario:

  • System A uses fsspec for writing tables into Nessie/Iceberg
  • System B needs to read the same tables using PyArrow

With the current implementation, System B can never successfully read the tables because:

  • The server forces the client to use fsspec
  • This happens even when the client explicitly requests PyArrow
  • There's no way to override this behavior at the client level

Question

Is there a way for for PyIceberg to use a specific FileIO implementation regardless of what the table endpoint or the server returns?
This would be particularly useful in scenarios where:

  • The client environment is set up for a specific implementation
  • Different FileIO implementations might be more efficient in certain environments
  • Required dependencies for the server-specified implementation aren't available locally.

I've attached a test file that demonstrates the behavior

Would love to hear your thoughts on this! Is this the intended behavior? If so, could we perhaps consider adding a way to override the table-level FileIO implementations?

Thanks

test.txt

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@bigluck bigluck changed the title FileIO Implementation Configuration Priority Question py-io-impl configuration propagation Jan 29, 2025
@bigluck bigluck changed the title py-io-impl configuration propagation py-io-impl config propagation Jan 29, 2025
@kevinjqliu
Copy link
Contributor

Hey @bigluck good to hear from you again :) happy new year!

You brought up an interesting use case and i think we should cement the right behavior. There are a couple of different sets of configs

  1. catalog initialization configs, we can also call it "client config" since its the client of IRC
  2. IRC server default (from the /config endpoint), this is the "catalog config"
  3. namespace configs (the namespace itself can specify properties)
  4. table configs

From the server side, the table configs take precedence over namespace configs over server configs.

Between the client and the server, I'm not sure which one should take precedence. Currently, the server configs take precedence and we override client configs during catalog initialization.

def _fetch_config(self) -> None:
params = {}
if warehouse_location := self.properties.get(WAREHOUSE_LOCATION):
params[WAREHOUSE_LOCATION] = warehouse_location
with self._create_session() as session:
response = session.get(self.url(Endpoints.get_config, prefixed=False), params=params)
try:
response.raise_for_status()
except HTTPError as exc:
self._handle_non_200_response(exc, {})
config_response = ConfigResponse(**response.json())
config = config_response.defaults
config.update(self.properties)
config.update(config_response.overrides)
self.properties = config
# Update URI based on overrides
self.uri = config[URI]

We should figure out if this is the right behavior. Is the server config more of a suggestion or a must-override?

@bigluck
Copy link
Contributor Author

bigluck commented Jan 29, 2025

Thanks @kevinjqliu , and happy new year too :)

I see your point, and I see how the server should always control the clients, but from my PoV the client should always have a way to overwrite these defaults, otherwise you can't mix different server implementation together by using a single data catalog.

The current situation is kind of confusing.

 catalog = load_catalog( 
     "docs", 
     **{ 
         "uri": "http://127.0.0.1:8181", 
         "s3.endpoint": "http://127.0.0.1:9000", 
         "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", 
         "s3.access-key-id": "admin", 
         "s3.secret-access-key": "password", 
     } 
 ) 

In this example py-io-impl seems not to be a "client config" but more a "catalog config", but uri is technically a "client config" and in any case both of them are stored on the same property: self.properties.

Once started the Catalog REST API class send a request to the /config endpoint by invoking the _fetch_config() function, and it does:

        config = config_response.defaults
        config.update(self.properties)
        config.update(config_response.overrides)
        self.properties = config

It takes the defaults from the remote server, it extend them with self.properties (my uri, s3.endpoint, py-io-impl, s3.access-key-id, s3.secret-access-key) and it overrides the final config with overrides.
That's fine :)

In this case if the server wants to override the py-io-impl it can, by setting the corresponding overrides key, otherwise the user wins against the remote config.

What's weird is that these properties are not used when we to a load_table:

        identifier_tuple = self._identifier_to_tuple_without_catalog(identifier)
        response = self._session.get(
            self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier_tuple))
        )
        try:
            response.raise_for_status()
        except HTTPError as exc:
            self._handle_non_200_response(exc, {404: NoSuchTableError})

        table_response = TableResponse(**response.json())
        return self._response_to_table(identifier_tuple, table_response)

we get the table data from the remote server, we ignore the local configuration and we create a new table object:

    def _response_to_table(self, identifier_tuple: Tuple[str, ...], table_response: TableResponse) -> Table:
        return Table(
            identifier=identifier_tuple,
            metadata_location=table_response.metadata_location,  # type: ignore
            metadata=table_response.metadata,
            io=self._load_file_io(
                {**table_response.metadata.properties, **table_response.config}, table_response.metadata_location
            ),
            catalog=self,
            config=table_response.config,
        )

This is what I think it's not ok:

{**table_response.metadata.properties, **table_response.config}

We should have a way to overwrite both the table properties and the table config, because the client knows what it can use and what it can't.

I think there are ~3 options:

  1. {**table_response.metadata.properties, **table_response.config, **self.properties}
    I don't like it because we propagate everything

  2. When the user invoke load_table() she can optionally set list of properties that will overwrite the remote properties:

load_table('my.table', **{"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})
  1. Similarly to 1, but with a limited set of properties. In other words we set a configurable list of porperties that needs to be propagated to the Table object:
{**table_response.metadata.properties, **table_response.config, **{k: v for k, v in self.properties.items() where k in self.white_list_of_keys}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants