Core: Support DV for partition stats by ajantha-bhat · Pull Request #13425 · apache/iceberg

ajantha-bhat · 2025-06-30T01:57:59Z

ajantha-bhat · 2025-06-30T05:04:15Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+        DATA_RECORD_COUNT,
+        DATA_FILE_COUNT,
+        TOTAL_DATA_FILE_SIZE_IN_BYTES,
+        NestedField.required(


As per the spec PR (these fields are required now)
#12098

ajantha-bhat · 2025-06-30T05:04:38Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

  public static final NestedField LAST_UPDATED_SNAPSHOT_ID =
      NestedField.optional(12, "last_updated_snapshot_id", LongType.get());
+  public static final NestedField DV_COUNT =
+      NestedField.required(13, "dv_count", IntegerType.get());


schema id, field name, data type and required field as per spec
#12098

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

ajantha-bhat · 2025-06-30T05:11:46Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

  private static PartitionMap<PartitionStats> collectStatsForManifest(
-      Table table, ManifestFile manifest, StructType partitionType, boolean incremental) {
+      Table table,
+      int version,


Even though we can deduce the version inside this API, I didn't want to compute the version for each thread. Hence, computed in the caller.

core/src/main/java/org/apache/iceberg/PartitionStats.java

ajantha-bhat · 2025-06-30T06:25:50Z

cc: @aokolnychyi, @stevenzwu, @pvary, @lirui-apache, @deniskuzZ, @RussellSpitzer: Please take a look.

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

pvary · 2025-06-30T09:34:27Z

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

WDYT?

ajantha-bhat · 2025-06-30T12:10:49Z

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

Table metadata is JSON file and json can have optional fields and can omit during the write/read serialization. But this is schema based parquet or avro file. If we see the manifest, we do have like V1Metadata.java, V2Metadata.java, V3Metadata.java . Similarly I followed a new object.

Plus I felt it is clean for v2 writers or readers to no have the members of v3 (as null) when they read it.
Lastly, PartitionStats is StructLike and get and set cannot accept format version. So, we may need code duplication and need to store version info for PartitionStats which is not a good idea.

pvary · 2025-07-01T08:58:55Z

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

Table metadata is JSON file and json can have optional fields and can omit during the write/read serialization. But this is schema based parquet or avro file. If we see the manifest, we do have like V1Metadata.java, V2Metadata.java, V3Metadata.java . Similarly I followed a new object.

V1Metadata.java, V2Metadata.java, V3Metadata.java are package private classes. They are not exposed to the users.

The corresponding public interface is ManifestFile which behaves as I have suggested for PartitionStats. The ManifestFile contains accessors for every fields from V1/V2/V3, and V1Metadata.ManifestFileWrapper, implements ManifestFile.

I think we should follow the same pattern here.

stevenzwu · 2025-07-01T21:11:39Z

I agree with @pvary on the reasoning and comparison with metadata and manifiest file

A new optional DV_COUNT field is probably good, which will also result in simpler code.

ajantha-bhat · 2025-07-02T06:29:14Z

Thanks @pvary and @stevenzwu for the response. I will try it out your approach and get back on this if any problems for this approach.

ajantha-bhat · 2025-07-02T12:38:58Z

core/src/main/java/org/apache/iceberg/PartitionStats.java

        this.lastUpdatedSnapshotId = (Long) value;
        break;
+      case 12:
+        this.dvCount = value == null ? 0 : (int) value;


defaulting to 0. Just like other counters (pos/eq deletes) were default to 0.

ajantha-bhat · 2025-07-02T12:49:31Z

@pvary, @stevenzwu : Please take another look. I have addressed the comments.

ajantha-bhat · 2025-07-02T16:58:07Z

Restarting the build due to Spark flaky test

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

core/src/main/java/org/apache/iceberg/PartitionStats.java

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java

core/src/main/java/org/apache/iceberg/PartitionStats.java

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java

ajantha-bhat · 2025-07-07T03:55:43Z

@pvary, @stevenzwu, @nastra: I have addressed comments and also I found an issue when v3 reading the v2 stats as DV is required field in schema. The reading failed and incremental compute fallback to full compute because of that. I fixed it using "default value" feature of v3. Incremental compute still works with this upgrade. I have added the test. Please take another look at this PR. Thanks.

ajantha-bhat · 2025-07-07T04:04:50Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+      NestedField.required("dv_count")
+          .withId(13)
+          .ofType(Types.IntegerType.get())
+          .withInitialDefault(Literal.of(0))


Using default 0 for v3. Because when we try to read v2 stats with v3 schema, we get the field not found error without the default values configuration as dv is a required field in schema.

callstack:

Missing required field: dv_count java.lang.IllegalArgumentException: Missing required field: dv_count at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269) at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252) at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:219) at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:207) at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:48) at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67) at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:59) at org.apache.iceberg.data.parquet.InternalReader.create(InternalReader.java:40) at org.apache.iceberg.parquet.Parquet$ReadBuilder.lambda$build$0(Parquet.java:1368) at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:121) at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:74) at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:94) at org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:205) at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:204) at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:196) at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:139) at org.apache.iceberg.PartitionStatsHandlerTestBase.testV2toV3SchemaEvolution(PartitionStatsHandlerTestBase.java:695) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)

awesome job for adding a unit test for this

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java

nastra

LGTM once the comments around testing have been addressed

stevenzwu · 2025-07-07T21:40:14Z

merge this now. if there are more review comments, we can follow up separately

stevenzwu · 2025-07-07T21:41:19Z

thanks @ajantha-bhat for the contribution and @pvary @nastra for the reviews

github-actions bot added the core label Jun 30, 2025

ajantha-bhat marked this pull request as draft June 30, 2025 01:58

ajantha-bhat commented Jun 30, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved

ajantha-bhat commented Jun 30, 2025

View reviewed changes

core: Support DV for partition stats

f05a6ed

ajantha-bhat force-pushed the dv_stats branch from b0bea96 to f05a6ed Compare June 30, 2025 05:36

ajantha-bhat marked this pull request as ready for review June 30, 2025 05:43

ajantha-bhat commented Jun 30, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStats.java Outdated Show resolved Hide resolved

pvary reviewed Jun 30, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved

pvary reviewed Jun 30, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved

ajantha-bhat force-pushed the dv_stats branch from 733a25e to 30adb08 Compare June 30, 2025 13:23

Address comments

465cb22

ajantha-bhat force-pushed the dv_stats branch from 30adb08 to 465cb22 Compare June 30, 2025 17:28

ajantha-bhat commented Jul 2, 2025

View reviewed changes

Remove PartitionStatsV3

271a970

ajantha-bhat force-pushed the dv_stats branch from f74d509 to 271a970 Compare July 2, 2025 12:48

ajantha-bhat closed this Jul 2, 2025

ajantha-bhat reopened this Jul 2, 2025

stevenzwu reviewed Jul 2, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

stevenzwu reviewed Jul 2, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Show resolved Hide resolved

Address new comments

9978fd5

ajantha-bhat force-pushed the dv_stats branch from c2f2fd3 to 9978fd5 Compare July 3, 2025 08:27

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStats.java Show resolved Hide resolved

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStats.java Show resolved Hide resolved

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStats.java Outdated Show resolved Hide resolved

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

nastra reviewed Jul 3, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStats.java Outdated Show resolved Hide resolved

pvary reviewed Jul 3, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

Handle V3 reading V2 stats

f9ca9a6

github-actions bot added the ORC label Jul 7, 2025

ajantha-bhat commented Jul 7, 2025

View reviewed changes

ajantha-bhat changed the title ~~core: Support DV for partition stats~~ Core: Support DV for partition stats Jul 7, 2025

nastra reviewed Jul 7, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

nastra reviewed Jul 7, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java Outdated Show resolved Hide resolved

nastra approved these changes Jul 7, 2025

View reviewed changes

Address nits

36b85a5

ajantha-bhat mentioned this pull request Jul 7, 2025

Partition stats task tracker #8450

Closed

13 tasks

stevenzwu approved these changes Jul 7, 2025

View reviewed changes

stevenzwu merged commit 9f91295 into apache:main Jul 7, 2025
42 checks passed

Conversation

ajantha-bhat commented Jun 30, 2025

Uh oh!

ajantha-bhat Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajantha-bhat Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajantha-bhat commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

pvary commented Jun 30, 2025

Uh oh!

ajantha-bhat commented Jun 30, 2025

Uh oh!

pvary commented Jul 1, 2025

Uh oh!

stevenzwu commented Jul 1, 2025

Uh oh!

ajantha-bhat commented Jul 2, 2025

Uh oh!

ajantha-bhat Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented Jul 2, 2025

Uh oh!

ajantha-bhat commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajantha-bhat commented Jul 7, 2025

Uh oh!

ajantha-bhat Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Jul 7, 2025

Uh oh!

Uh oh!

stevenzwu commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajantha-bhat Jun 30, 2025 •

edited

Loading