-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569
base: main
Are you sure you want to change the base?
Conversation
When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps. Credits to @fussion2222 for suggesting this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Generally LGTM, another option is to pass the option down from add_files
.
Could you also add a test case for this?
Something similar to,
def test_add_files_with_timestamp_tz_ns_fails(session_catalog: Catalog, format_version: int, mocker: MockerFixture) -> None: |
…ils when the PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is set to False test_add_files_with_automatic_downcast_of_timestamp_to_us: added the test as requested to allow the downcasting of ns timestamp columns to us based on the env var PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE being set to True
@kevinjqliu Thanks for having a look. I've updated the PR and added a test case for it as requested. I'm open for suggestions and corrections. Thanks! |
Thanks for working on this PR @lloyd-EA and thank you @kevinjqliu for the review. Could I add another suggestion to read the table using Spark and verify that the value that's been read is equal to the expected value? As I mentioned in a comment in #1045 (comment) I think it would be great to verify that the data is readable as iceberg-python/tests/integration/test_add_files.py Lines 154 to 177 in 9850290
|
It's expected to fail as add_files does not rewrite the datafiles, therefore, it keeps the 'ns' precision on timestamp and Spark TimeStampType does not support nanoseconds
@sungwy I updated the PR there. It seems that Spark reads in the column as expected: To me, this is a problem with Spark not being able to handle |
Thank you for following up on the suggestion @lloyd-EA . I'm glad we included this integration test to find out the issue with Spark's timestamp conversion. I think we will have to investigate the corresponding schema evolution handling logic in Spark-Iceberg to understand if this is a bug, or an expected behavior with Spark. If it is not possible to fix this issue in Spark-Iceberg, I think it'll mean that we will not be able to support adding Please do keep in mind that Iceberg spec is a language agnostic spec, and hence ruling out the possibility of using the same Iceberg table with other tools (like Trino or Spark) may impact you down the line when your application system grows. |
When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps.
Credits to @fussion2222 for suggesting this fix.