Skip to content

Data lost when a large number of equality delete records exist #624

@CodingJun

Description

@CodingJun

Describe

When a table contains a large number of equality delete records, some data rows that should be returned by queries are missing. This appears to happen even when the rows are not logically deleted. When using other engines (e.g., Apache Spark) to read the same table, the results are correct and all expected rows are returned. This suggests that the table data and delete files are valid, and the problem is likely specific to this engine’s equality-delete implementation or planning logic.

Env

Duckdb version: 1.4.1 (also occurs in 1.4.3)
Iceberg version: 1.5.2
Spark version: 3.5.4

Steps to reproduce

  1. Create a table and insert a dataset of size X (such as 50k).
  2. Write a large number of equality delete records Y (such as 49995).
  3. Run a query such as: SELECT * FROM <table>.
  4. Observe that some rows that are not deleted are missing from the result.

Expected behavior

Queries should return all rows that are not logically deleted, regardless of the number of equality delete records.

Actual behavior

Some valid rows are filtered out when the number of equality delete records grows large.

Example (for fast reproduce)

  1. Download the attachment equality_test.zip
  2. Unzip the archive and place the content into the directory: /tmp/data/
  3. Execute select * from iceberg_scan('/tmp/data/equality_test/metadata/v3.metadata.json'); in Duckdb
  4. Observe that the result is empty. Image
  5. Execute select * from iceberg_scan('/tmp/data/equality_test/metadata/v3.metadata.json') where _id = 'id_183550'; in Duckdb
  6. Observe that the result is not empty. Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions