Some column statistics are missing after writing data to a table #1482

Open

Open

Some column statistics are missing after writing data to a table#1482

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Following this Slack thread:

Seems like column statistics are not fully collected when writing data either by using the Arrow Dataframe API or the add_files method.

Example 1: Stats after using add_files method (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_add_files;

+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size|distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null     |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |null     |null                 |null          |null     |null                |null               |
|probability              |null     |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null     |null                 |null          |897861060|null                |null               |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+

As you can see, location column is missing statistics when using add_files.
In addition, while calling add_files I encountered this error many times (but the operation succeeded):

PyArrow statistics missing for column 1 when writing file

Example 2: Stats after using Arrow Dataframe API to load the same Parquet files (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|null                 |0             |null     |null                |null               |
|probability              |null       |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

On the other hand, after collecting table statistics using Trino, the column statistics look more complete:

analyze iceberg_test.tests.my_table_df;

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |446189731            |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|15973993             |0             |null     |null                |null               |
|probability              |null       |4159975              |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests