Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

lloyd-EA · 2025-01-24T14:37:43Z

When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps.

Credits to @fussion2222 for suggesting this fix.

When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps. Credits to @fussion2222 for suggesting this fix.

kevinjqliu

Thanks for the PR!

Generally LGTM, another option is to pass the option down from add_files.

Could you also add a test case for this?
Something similar to,

iceberg-python/tests/integration/test_add_files.py

Line 588 in 3b53edc

    
           def test_add_files_with_timestamp_tz_ns_fails(session_catalog: Catalog, format_version: int, mocker: MockerFixture) -> None:

…ils when the PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is set to False test_add_files_with_automatic_downcast_of_timestamp_to_us: added the test as requested to allow the downcasting of ns timestamp columns to us based on the env var PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE being set to True

lloyd-EA · 2025-01-27T13:19:35Z

@kevinjqliu Thanks for having a look.

I've updated the PR and added a test case for it as requested.

I'm open for suggestions and corrections. Thanks!

sungwy · 2025-01-28T04:19:10Z

Thanks for working on this PR @lloyd-EA and thank you @kevinjqliu for the review. Could I add another suggestion to read the table using Spark and verify that the value that's been read is equal to the expected value?

As I mentioned in a comment in #1045 (comment) I think it would be great to verify that the data is readable as us timestamp through a Spark session as well. Here's an example that uses a spark session for reference:

iceberg-python/tests/integration/test_add_files.py

Lines 154 to 177 in 9850290

    
           def test_add_files_to_unpartitioned_table(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None: 
        
               identifier = f"default.unpartitioned_table_v{format_version}" 
        
               tbl = _create_table(session_catalog, identifier, format_version) 
        
               file_paths = [f"s3://warehouse/default/unpartitioned/v{format_version}/test-{i}.parquet" for i in range(5)] 
        
               # write parquet files 
        
               for file_path in file_paths: 
        
                   fo = tbl.io.new_output(file_path) 
        
                   with fo.create(overwrite=True) as fos: 
        
                       with pq.ParquetWriter(fos, schema=ARROW_SCHEMA) as writer: 
        
                           writer.write_table(ARROW_TABLE) 
        
               # add the parquet files as data files 
        
               tbl.add_files(file_paths=file_paths) 
        
               # NameMapping must have been set to enable reads 
        
               assert tbl.name_mapping() is not None 
        
               rows = spark.sql( 
        
                   f""" 
        
                   SELECT added_data_files_count, existing_data_files_count, deleted_data_files_count 
        
                   FROM {identifier}.all_manifests 
        
               """ 
        
               ).collect()

It's expected to fail as add_files does not rewrite the datafiles, therefore, it keeps the 'ns' precision on timestamp and Spark TimeStampType does not support nanoseconds

lloyd-EA · 2025-01-28T19:04:47Z

@sungwy I updated the PR there. It seems that Spark reads in the column as us and doesn't check the underlying data file's precision being ns, so when it is converted to a datetime the value is wrong.

expected: 2021-03-17 07:54:47.249846175

To me, this is a problem with Spark not being able to handle ns timestamps properly, and since I won't be using Spark for my application, I'll continue to use my forked repo for now as it's sufficient for my application.

sungwy · 2025-01-29T12:57:57Z

Thank you for following up on the suggestion @lloyd-EA . I'm glad we included this integration test to find out the issue with Spark's timestamp conversion.

I think we will have to investigate the corresponding schema evolution handling logic in Spark-Iceberg to understand if this is a bug, or an expected behavior with Spark.

If it is not possible to fix this issue in Spark-Iceberg, I think it'll mean that we will not be able to support adding ns timestamp files until V3 spec is supported.

Please do keep in mind that Iceberg spec is a language agnostic spec, and hence ruling out the possibility of using the same Iceberg table with other tools (like Trino or Spark) may impact you down the line when your application system grows.

See Github issue 1045

adde281

When calling Table.add_files(), the env PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE isn't being used on _check_pyarrow_schema_compatible, therefore, it fails for parquet files with ns timestamps. Credits to @fussion2222 for suggesting this fix.

kevinjqliu reviewed Jan 24, 2025

View reviewed changes

kevinjqliu mentioned this pull request Jan 26, 2025

Support for timestamp downcasting when loading data to iceberg tables #1045

Open

lloyd-EA added 2 commits January 27, 2025 14:17

Linting errors fixed

bd1d5ad

value in the comment didn't match

b0f1e7f

lloyd-EA added 2 commits January 28, 2025 18:41

Check for spark querying "downcasted" ns to us data files

d83f858

It's expected to fail as add_files does not rewrite the datafiles, therefore, it keeps the 'ns' precision on timestamp and Spark TimeStampType does not support nanoseconds

Linting errors fixed

3653327

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

lloyd-EA commented Jan 24, 2025

kevinjqliu left a comment

lloyd-EA commented Jan 27, 2025

sungwy commented Jan 28, 2025 •

edited

Loading

lloyd-EA commented Jan 28, 2025 •

edited

Loading

sungwy commented Jan 29, 2025

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

Are you sure you want to change the base?

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

Conversation

lloyd-EA commented Jan 24, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

lloyd-EA commented Jan 27, 2025

sungwy commented Jan 28, 2025 • edited Loading

lloyd-EA commented Jan 28, 2025 • edited Loading

sungwy commented Jan 29, 2025

sungwy commented Jan 28, 2025 •

edited

Loading

lloyd-EA commented Jan 28, 2025 •

edited

Loading