-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py-io-impl
config propagation
#1589
Comments
py-io-impl
configuration propagation
py-io-impl
configuration propagationpy-io-impl
config propagation
Hey @bigluck good to hear from you again :) happy new year! You brought up an interesting use case and i think we should cement the right behavior. There are a couple of different sets of configs
From the server side, the table configs take precedence over namespace configs over server configs. Between the client and the server, I'm not sure which one should take precedence. Currently, the server configs take precedence and we override client configs during catalog initialization. iceberg-python/pyiceberg/catalog/rest.py Lines 383 to 402 in 1adbb87
We should figure out if this is the right behavior. Is the server config more of a suggestion or a must-override? |
Thanks @kevinjqliu , and happy new year too :) I see your point, and I see how the server should always control the clients, but from my PoV the client should always have a way to overwrite these defaults, otherwise you can't mix different server implementation together by using a single data catalog. The current situation is kind of confusing.
In this example Once started the Catalog REST API class send a request to the config = config_response.defaults
config.update(self.properties)
config.update(config_response.overrides)
self.properties = config It takes the In this case if the server wants to override the What's weird is that these properties are not used when we to a identifier_tuple = self._identifier_to_tuple_without_catalog(identifier)
response = self._session.get(
self.url(Endpoints.load_table, prefixed=True, **self._split_identifier_for_path(identifier_tuple))
)
try:
response.raise_for_status()
except HTTPError as exc:
self._handle_non_200_response(exc, {404: NoSuchTableError})
table_response = TableResponse(**response.json())
return self._response_to_table(identifier_tuple, table_response) we get the table data from the remote server, we ignore the local configuration and we create a new table object: def _response_to_table(self, identifier_tuple: Tuple[str, ...], table_response: TableResponse) -> Table:
return Table(
identifier=identifier_tuple,
metadata_location=table_response.metadata_location, # type: ignore
metadata=table_response.metadata,
io=self._load_file_io(
{**table_response.metadata.properties, **table_response.config}, table_response.metadata_location
),
catalog=self,
config=table_response.config,
) This is what I think it's not ok:
We should have a way to overwrite both the table properties and the table config, because the client knows what it can use and what it can't. I think there are ~3 options:
load_table('my.table', **{"py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO"})
{**table_response.metadata.properties, **table_response.config, **{k: v for k, v in self.properties.items() where k in self.white_list_of_keys}} |
Please describe the bug 🐞
Hey team! 👋 Hope you're doing well!
I've been working with PyIceberg and ran into an interesting situation regarding FileIO implementation configurations.
TLTR: it seems like
pyiceberg
does not allow to overwrite (locally) some configurations returned by the remote Iceberg REST api, and I would like to understand if this is intended and what is the way to overwrite some of these configurations.Context
/config
endpoint) is configured to usePyArrow
as the defaultFileIO
implementationFileIO
implementation (fsspec
)PyArrow
locally since I don't havefsspec/s3fs
installed (and prefer usingPyArrow
)Starting from PR 9868, the
Nessie
Iceberg endpoint always returnsconfig.py-io-impl=pyiceberg.io.fsspec.FsspecFileIO
when querying a table endpoint. While there might be a separate issue in Nessie's implementation (as my server is configured to override thepy-io-impl
withpyarrow.PyArrowFileIO
), I believe there's also a concern in PyIceberg's handling of configuration priorities.For more info:
Current Behavior
I've created a test case that explores different configuration scenarios. Here's what I'm observing:
similarly to what is describe on the official documentation:
iceberg-python/mkdocs/docs/api.md
Lines 56 to 70 in 1adbb87
The table configuration from the REST API seems to take precedence over both:
This leads to failures when the table endpoint specifies
fsspec.FsspecFileIO
buts3fs
isn't available locally:Real-world Impact
This configuration priority issue creates practical problems in multi-system setups. Consider this scenario:
fsspec
for writing tables into Nessie/IcebergPyArrow
With the current implementation, System B can never successfully read the tables because:
fsspec
PyArrow
Question
Is there a way for for PyIceberg to use a specific FileIO implementation regardless of what the table endpoint or the server returns?
This would be particularly useful in scenarios where:
FileIO
implementations might be more efficient in certain environmentsI've attached a test file that demonstrates the behavior
Would love to hear your thoughts on this! Is this the intended behavior? If so, could we perhaps consider adding a way to override the table-level FileIO implementations?
Thanks
test.txt
Willingness to contribute
The text was updated successfully, but these errors were encountered: