Open
Description
Apache Iceberg version
0.8.1 (latest release)
Please describe the bug 🐞
Following this Slack thread:
Seems like column statistics are not fully collected when writing data either by using the Arrow Dataframe API or the add_files
method.
Example 1: Stats after using add_files
method (shown using Trino show stats
):
show stats for iceberg_test.tests.my_table_add_files;
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|column_name |data_size|distinct_values_count|nulls_fraction|row_count|low_value |high_value |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|id |null |null |0 |null |-9223372031483295744|9223372014179617792|
|location |null |null |null |null |null |null |
|probability |null |null |0 |null |0.2500010132789612 |1.0 |
|null |null |null |null |897861060|null |null |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
As you can see, location
column is missing statistics when using add_files
.
In addition, while calling add_files
I encountered this error many times (but the operation succeeded):
PyArrow statistics missing for column 1 when writing file
Example 2: Stats after using Arrow Dataframe API to load the same Parquet files (shown using Trino show stats
):
show stats for iceberg_test.tests.my_table_df;
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name |data_size |distinct_values_count|nulls_fraction|row_count|low_value |high_value |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id |null |null |0 |null |-9223372031483295744|9223372014179617792|
|location |14001826255|null |0 |null |null |null |
|probability |null |null |0 |null |0.2500010132789612 |1.0 |
|null |null |null |null |897861060|null |null |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
On the other hand, after collecting table statistics using Trino, the column statistics look more complete:
analyze iceberg_test.tests.my_table_df;
show stats for iceberg_test.tests.my_table_df;
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name |data_size |distinct_values_count|nulls_fraction|row_count|low_value |high_value |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id |null |446189731 |0 |null |-9223372031483295744|9223372014179617792|
|location |14001826255|15973993 |0 |null |null |null |
|probability |null |4159975 |0 |null |0.2500010132789612 |1.0 |
|null |null |null |null |897861060|null |null |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time
Metadata
Metadata
Assignees
Labels
No labels