`py-io-impl` config propagation #1589

bigluck · 2025-01-29T09:22:20Z

Please describe the bug 🐞

Hey team! 👋 Hope you're doing well!
I've been working with PyIceberg and ran into an interesting situation regarding FileIO implementation configurations.

TLTR: it seems like pyiceberg does not allow to overwrite (locally) some configurations returned by the remote Iceberg REST api, and I would like to understand if this is intended and what is the way to overwrite some of these configurations.

Context

The REST API server (/config endpoint) is configured to use PyArrow as the default FileIO implementation
The table endpoint returns a configuration specifying a different FileIO implementation (fsspec)
I want to force PyArrow locally since I don't have fsspec/s3fs installed (and prefer using PyArrow)

Starting from PR 9868, the Nessie Iceberg endpoint always returns config.py-io-impl=pyiceberg.io.fsspec.FsspecFileIO when querying a table endpoint. While there might be a separate issue in Nessie's implementation (as my server is configured to override the py-io-impl with pyarrow.PyArrowFileIO), I believe there's also a concern in PyIceberg's handling of configuration priorities.

For more info:

My chat with the Nessie team on zulip
their original PR 9868
the temp workaround the Nessie team is implementing now PR 10296 / PR 10292

Current Behavior

I've created a test case that explores different configuration scenarios. Here's what I'm observing:

pythonCopycatalog = load_catalog("docs", **{"uri": "https://a.b.c.d/iceberg", "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})
table = catalog.load_table("my_namespace.my_table")

similarly to what is describe on the official documentation:

iceberg-python/mkdocs/docs/api.md

Lines 56 to 70 in 1adbb87

    
           Then load the `prod` catalog: 
        
           ```python 
        
           from pyiceberg.catalog import load_catalog 
        
           catalog = load_catalog( 
        
               "docs", 
        
               **{ 
        
                   "uri": "http://127.0.0.1:8181", 
        
                   "s3.endpoint": "http://127.0.0.1:9000", 
        
                   "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", 
        
                   "s3.access-key-id": "admin", 
        
                   "s3.secret-access-key": "password", 
        
               } 
        
           )

The table configuration from the REST API seems to take precedence over both:

The server's default configuration
Local overrides passed during catalog initialization

This leads to failures when the table endpoint specifies fsspec.FsspecFileIO but s3fs isn't available locally:

ValueError: Could not initialize FileIO: pyiceberg.io.fsspec.FsspecFileIO

Real-world Impact

This configuration priority issue creates practical problems in multi-system setups. Consider this scenario:

System A uses fsspec for writing tables into Nessie/Iceberg
System B needs to read the same tables using PyArrow

With the current implementation, System B can never successfully read the tables because:

The server forces the client to use fsspec
This happens even when the client explicitly requests PyArrow
There's no way to override this behavior at the client level

Question

Is there a way for for PyIceberg to use a specific FileIO implementation regardless of what the table endpoint or the server returns?
This would be particularly useful in scenarios where:

The client environment is set up for a specific implementation
Different FileIO implementations might be more efficient in certain environments
Required dependencies for the server-specified implementation aren't available locally.

I've attached a test file that demonstrates the behavior

Would love to hear your thoughts on this! Is this the intended behavior? If so, could we perhaps consider adding a way to override the table-level FileIO implementations?

Thanks

test.txt

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2025-01-29T17:19:05Z

Hey @bigluck good to hear from you again :) happy new year!

You brought up an interesting use case and i think we should cement the right behavior. There are a couple of different sets of configs

catalog initialization configs, we can also call it "client config" since its the client of IRC
IRC server default (from the /config endpoint), this is the "catalog config"
namespace configs (the namespace itself can specify properties)
table configs

From the server side, the table configs take precedence over namespace configs over server configs.

Between the client and the server, I'm not sure which one should take precedence. Currently, the server configs take precedence and we override client configs during catalog initialization.

iceberg-python/pyiceberg/catalog/rest.py

Lines 383 to 402 in 1adbb87

    
           def _fetch_config(self) -> None: 
        
               params = {} 
        
               if warehouse_location := self.properties.get(WAREHOUSE_LOCATION): 
        
                   params[WAREHOUSE_LOCATION] = warehouse_location 
        
               with self._create_session() as session: 
        
                   response = session.get(self.url(Endpoints.get_config, prefixed=False), params=params) 
        
               try: 
        
                   response.raise_for_status() 
        
               except HTTPError as exc: 
        
                   self._handle_non_200_response(exc, {}) 
        
               config_response = ConfigResponse(**response.json()) 
        
               config = config_response.defaults 
        
               config.update(self.properties) 
        
               config.update(config_response.overrides) 
        
               self.properties = config 
        
               # Update URI based on overrides 
        
               self.uri = config[URI]

We should figure out if this is the right behavior. Is the server config more of a suggestion or a must-override?

bigluck · 2025-01-29T18:42:36Z

Thanks @kevinjqliu , and happy new year too :)

I see your point, and I see how the server should always control the clients, but from my PoV the client should always have a way to overwrite these defaults, otherwise you can't mix different server implementation together by using a single data catalog.

The current situation is kind of confusing.

 catalog = load_catalog( 
     "docs", 
     **{ 
         "uri": "http://127.0.0.1:8181", 
         "s3.endpoint": "http://127.0.0.1:9000", 
         "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO", 
         "s3.access-key-id": "admin", 
         "s3.secret-access-key": "password", 
     } 
 )

In this example py-io-impl seems not to be a "client config" but more a "catalog config", but uri is technically a "client config" and in any case both of them are stored on the same property: self.properties.

Once started the Catalog REST API class send a request to the /config endpoint by invoking the _fetch_config() function, and it does:

        config = config_response.defaults
        config.update(self.properties)
        config.update(config_response.overrides)
        self.properties = config

It takes the defaults from the remote server, it extend them with self.properties (my uri, s3.endpoint, py-io-impl, s3.access-key-id, s3.secret-access-key) and it overrides the final config with overrides.
That's fine :)

In this case if the server wants to override the py-io-impl it can, by setting the corresponding overrides key, otherwise the user wins against the remote config.

What's weird is that these properties are not used when we to a load_table:

        identifier_tuple = self._identifier_to_tuple_without_catalog(identifier)
        response = self._session.get(
            self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier_tuple))
        )
        try:
            response.raise_for_status()
        except HTTPError as exc:
            self._handle_non_200_response(exc, {404: NoSuchTableError})

        table_response = TableResponse(**response.json())
        return self._response_to_table(identifier_tuple, table_response)

we get the table data from the remote server, we ignore the local configuration and we create a new table object:

    def _response_to_table(self, identifier_tuple: Tuple[str, ...], table_response: TableResponse) -> Table:
        return Table(
            identifier=identifier_tuple,
            metadata_location=table_response.metadata_location,  # type: ignore
            metadata=table_response.metadata,
            io=self._load_file_io(
                {**table_response.metadata.properties, **table_response.config}, table_response.metadata_location
            ),
            catalog=self,
            config=table_response.config,
        )

This is what I think it's not ok:

{**table_response.metadata.properties, **table_response.config}

We should have a way to overwrite both the table properties and the table config, because the client knows what it can use and what it can't.

I think there are ~3 options:

{**table_response.metadata.properties, **table_response.config, **self.properties}
I don't like it because we propagate everything
When the user invoke load_table() she can optionally set list of properties that will overwrite the remote properties:

load_table('my.table', **{"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})

Similarly to 1, but with a limited set of properties. In other words we set a configurable list of porperties that needs to be propagated to the Table object:

{**table_response.metadata.properties, **table_response.config, **{k: v for k, v in self.properties.items() where k in self.white_list_of_keys}}

bigluck changed the title ~~FileIO Implementation Configuration Priority Question~~ py-io-impl configuration propagation Jan 29, 2025

bigluck changed the title ~~py-io-impl configuration propagation~~ py-io-impl config propagation Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`py-io-impl` config propagation #1589

`py-io-impl` config propagation #1589

bigluck commented Jan 29, 2025 •

edited

Loading

kevinjqliu commented Jan 29, 2025

bigluck commented Jan 29, 2025

py-io-impl config propagation #1589

py-io-impl config propagation #1589

Comments

bigluck commented Jan 29, 2025 • edited Loading

Please describe the bug 🐞

Context

Current Behavior

Real-world Impact

Question

Willingness to contribute

kevinjqliu commented Jan 29, 2025

bigluck commented Jan 29, 2025

`py-io-impl` config propagation #1589

`py-io-impl` config propagation #1589

bigluck commented Jan 29, 2025 •

edited

Loading