Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lloyd-EA
Copy link

When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps.

Credits to @fussion2222 for suggesting this fix.

When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps.

Credits to @fussion2222 for suggesting this fix.
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Generally LGTM, another option is to pass the option down from add_files.

Could you also add a test case for this?
Something similar to,

def test_add_files_with_timestamp_tz_ns_fails(session_catalog: Catalog, format_version: int, mocker: MockerFixture) -> None:

…ils when the PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is set to False

test_add_files_with_automatic_downcast_of_timestamp_to_us: added the test as requested to allow the downcasting of ns timestamp columns to us based on the env var PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE being set to True
@lloyd-EA
Copy link
Author

@kevinjqliu Thanks for having a look.

I've updated the PR and added a test case for it as requested.

I'm open for suggestions and corrections. Thanks!

@sungwy
Copy link
Collaborator

sungwy commented Jan 28, 2025

Thanks for working on this PR @lloyd-EA and thank you @kevinjqliu for the review. Could I add another suggestion to read the table using Spark and verify that the value that's been read is equal to the expected value?

As I mentioned in a comment in #1045 (comment) I think it would be great to verify that the data is readable as us timestamp through a Spark session as well. Here's an example that uses a spark session for reference:

def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
identifier = f"default.unpartitioned_table_v{format_version}"
tbl = _create_table(session_catalog, identifier, format_version)
file_paths = [f"s3://warehouse/default/unpartitioned/v{format_version}/test-{i}.parquet" for i in range(5)]
# write parquet files
for file_path in file_paths:
fo = tbl.io.new_output(file_path)
with fo.create(overwrite=True) as fos:
with pq.ParquetWriter(fos, schema=ARROW_SCHEMA) as writer:
writer.write_table(ARROW_TABLE)
# add the parquet files as data files
tbl.add_files(file_paths=file_paths)
# NameMapping must have been set to enable reads
assert tbl.name_mapping() is not None
rows = spark.sql(
f"""
SELECT added_data_files_count, existing_data_files_count, deleted_data_files_count
FROM {identifier}.all_manifests
"""
).collect()

It's expected to fail as add_files does not rewrite the datafiles, therefore, it keeps the 'ns' precision on timestamp
and Spark TimeStampType does not support nanoseconds
@lloyd-EA
Copy link
Author

lloyd-EA commented Jan 28, 2025

@sungwy I updated the PR there. It seems that Spark reads in the column as us and doesn't check the underlying data file's precision being ns, so when it is converted to a datetime the value is wrong.

expected: 2021-03-17 07:54:47.249846175
image

To me, this is a problem with Spark not being able to handle ns timestamps properly, and since I won't be using Spark for my application, I'll continue to use my forked repo for now as it's sufficient for my application.

@sungwy
Copy link
Collaborator

sungwy commented Jan 29, 2025

Thank you for following up on the suggestion @lloyd-EA . I'm glad we included this integration test to find out the issue with Spark's timestamp conversion.

I think we will have to investigate the corresponding schema evolution handling logic in Spark-Iceberg to understand if this is a bug, or an expected behavior with Spark.

If it is not possible to fix this issue in Spark-Iceberg, I think it'll mean that we will not be able to support adding ns timestamp files until V3 spec is supported.

Please do keep in mind that Iceberg spec is a language agnostic spec, and hence ruling out the possibility of using the same Iceberg table with other tools (like Trino or Spark) may impact you down the line when your application system grows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants