Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some column statistics are missing after writing data to a table #1482

Open
1 of 3 tasks
rotem-ad opened this issue Jan 1, 2025 · 3 comments
Open
1 of 3 tasks

Some column statistics are missing after writing data to a table #1482

rotem-ad opened this issue Jan 1, 2025 · 3 comments

Comments

@rotem-ad
Copy link

rotem-ad commented Jan 1, 2025

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Following this Slack thread:

Seems like column statistics are not fully collected when writing data either by using the Arrow Dataframe API or the add_files method.

Example 1: Stats after using add_files method (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_add_files;

+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size|distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null     |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |null     |null                 |null          |null     |null                |null               |
|probability              |null     |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null     |null                 |null          |897861060|null                |null               |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+

As you can see, location column is missing statistics when using add_files.
In addition, while calling add_files I encountered this error many times (but the operation succeeded):

PyArrow statistics missing for column 1 when writing file

Example 2: Stats after using Arrow Dataframe API to load the same Parquet files (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|null                 |0             |null     |null                |null               |
|probability              |null       |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

On the other hand, after collecting table statistics using Trino, the column statistics look more complete:

analyze iceberg_test.tests.my_table_df;

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |446189731            |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|15973993             |0             |null     |null                |null               |
|probability              |null       |4159975              |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@kevinjqliu
Copy link
Contributor

Thanks for reporting this issue! Both write and add_file uses data_file_statistics_from_parquet_metadata to generate stats

Do you have the data file so i can try and reproduce this issue?

@kevinjqliu
Copy link
Contributor

Difference between table 1 (add_files) and table 2 (write)

  • data_size for location: null vs 14001826255
  • nulls_fraction for location: null vs 0

Difference between table 2 (write) and table 3 (trino stat collection)

  • distinct_values_count for id, location, and probability: null vs values

@amitgilad3
Copy link
Contributor

Hey @rotem-ad , do you have a file we can use to reproduce the issue, i can take a look into this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants