Skip to content

Some column statistics are missing after writing data to a table #1482

Open
@rotem-ad

Description

@rotem-ad

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Following this Slack thread:

Seems like column statistics are not fully collected when writing data either by using the Arrow Dataframe API or the add_files method.

Example 1: Stats after using add_files method (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_add_files;

+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size|distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null     |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |null     |null                 |null          |null     |null                |null               |
|probability              |null     |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null     |null                 |null          |897861060|null                |null               |
+-------------------------+---------+---------------------+--------------+---------+--------------------+-------------------+

As you can see, location column is missing statistics when using add_files.
In addition, while calling add_files I encountered this error many times (but the operation succeeded):

PyArrow statistics missing for column 1 when writing file

Example 2: Stats after using Arrow Dataframe API to load the same Parquet files (shown using Trino show stats):

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |null                 |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|null                 |0             |null     |null                |null               |
|probability              |null       |null                 |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

On the other hand, after collecting table statistics using Trino, the column statistics look more complete:

analyze iceberg_test.tests.my_table_df;

show stats for iceberg_test.tests.my_table_df;

+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|column_name              |data_size  |distinct_values_count|nulls_fraction|row_count|low_value           |high_value         |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+
|id                       |null       |446189731            |0             |null     |-9223372031483295744|9223372014179617792|
|location                 |14001826255|15973993             |0             |null     |null                |null               |
|probability              |null       |4159975              |0             |null     |0.2500010132789612  |1.0                |
|null                     |null       |null                 |null          |897861060|null                |null               |
+-------------------------+-----------+---------------------+--------------+---------+--------------------+-------------------+

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions