Skip to content

Core: Support DV for partition stats#13425

Merged
stevenzwu merged 6 commits intoapache:mainfrom
ajantha-bhat:dv_stats
Jul 7, 2025
Merged

Core: Support DV for partition stats#13425
stevenzwu merged 6 commits intoapache:mainfrom
ajantha-bhat:dv_stats

Conversation

@ajantha-bhat
Copy link
Member

fixes #13180

@github-actions github-actions bot added the core label Jun 30, 2025
@ajantha-bhat ajantha-bhat marked this pull request as draft June 30, 2025 01:58
DATA_RECORD_COUNT,
DATA_FILE_COUNT,
TOTAL_DATA_FILE_SIZE_IN_BYTES,
NestedField.required(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the spec PR (these fields are required now)
#12098

public static final NestedField LAST_UPDATED_SNAPSHOT_ID =
NestedField.optional(12, "last_updated_snapshot_id", LongType.get());
public static final NestedField DV_COUNT =
NestedField.required(13, "dv_count", IntegerType.get());
Copy link
Member Author

@ajantha-bhat ajantha-bhat Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema id, field name, data type and required field as per spec
#12098

private static PartitionMap<PartitionStats> collectStatsForManifest(
Table table, ManifestFile manifest, StructType partitionType, boolean incremental) {
Table table,
int version,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we can deduce the version inside this API, I didn't want to compute the version for each thread. Hence, computed in the caller.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review June 30, 2025 05:43
@ajantha-bhat
Copy link
Member Author

cc: @aokolnychyi, @stevenzwu, @pvary, @lirui-apache, @deniskuzZ, @RussellSpitzer: Please take a look.

@pvary
Copy link
Contributor

pvary commented Jun 30, 2025

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

WDYT?

@ajantha-bhat
Copy link
Member Author

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

Table metadata is JSON file and json can have optional fields and can omit during the write/read serialization. But this is schema based parquet or avro file. If we see the manifest, we do have like V1Metadata.java, V2Metadata.java, V3Metadata.java . Similarly I followed a new object.

Plus I felt it is clean for v2 writers or readers to no have the members of v3 (as null) when they read it.
Lastly, PartitionStats is StructLike and get and set cannot accept format version. So, we may need code duplication and need to store version info for PartitionStats which is not a good idea.

@pvary
Copy link
Contributor

pvary commented Jul 1, 2025

I went through the PR, but I'm not convinced that we need a different type for PartitionStatsV3.
We don't have different type for TableMetadata for different spec versions. We just added the new optional fields which might not be filled, and kept the other fields optional.

Table metadata is JSON file and json can have optional fields and can omit during the write/read serialization. But this is schema based parquet or avro file. If we see the manifest, we do have like V1Metadata.java, V2Metadata.java, V3Metadata.java . Similarly I followed a new object.

V1Metadata.java, V2Metadata.java, V3Metadata.java are package private classes. They are not exposed to the users.

The corresponding public interface is ManifestFile which behaves as I have suggested for PartitionStats. The ManifestFile contains accessors for every fields from V1/V2/V3, and V1Metadata.ManifestFileWrapper, implements ManifestFile.

I think we should follow the same pattern here.

@stevenzwu
Copy link
Contributor

I agree with @pvary on the reasoning and comparison with metadata and manifiest file

A new optional DV_COUNT field is probably good, which will also result in simpler code.

@ajantha-bhat
Copy link
Member Author

Thanks @pvary and @stevenzwu for the response. I will try it out your approach and get back on this if any problems for this approach.

this.lastUpdatedSnapshotId = (Long) value;
break;
case 12:
this.dvCount = value == null ? 0 : (int) value;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaulting to 0. Just like other counters (pos/eq deletes) were default to 0.

@ajantha-bhat
Copy link
Member Author

@pvary, @stevenzwu : Please take another look. I have addressed the comments.

@ajantha-bhat
Copy link
Member Author

Restarting the build due to Spark flaky test

@ajantha-bhat ajantha-bhat reopened this Jul 2, 2025
@github-actions github-actions bot added the ORC label Jul 7, 2025
@ajantha-bhat
Copy link
Member Author

@pvary, @stevenzwu, @nastra: I have addressed comments and also I found an issue when v3 reading the v2 stats as DV is required field in schema. The reading failed and incremental compute fallback to full compute because of that. I fixed it using "default value" feature of v3. Incremental compute still works with this upgrade. I have added the test. Please take another look at this PR. Thanks.

NestedField.required("dv_count")
.withId(13)
.ofType(Types.IntegerType.get())
.withInitialDefault(Literal.of(0))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using default 0 for v3. Because when we try to read v2 stats with v3 schema, we get the field not found error without the default values configuration as dv is a required field in schema.

callstack:

Missing required field: dv_count
	java.lang.IllegalArgumentException: Missing required field: dv_count
	at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
	at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
	at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:219)
	at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:207)
	at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:48)
	at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67)
	at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:59)
	at org.apache.iceberg.data.parquet.InternalReader.create(InternalReader.java:40)
	at org.apache.iceberg.parquet.Parquet$ReadBuilder.lambda$build$0(Parquet.java:1368)
	at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:121)
	at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:74)
	at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:94)
	at org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:205)
	at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:204)
	at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:196)
	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:139)
	at org.apache.iceberg.PartitionStatsHandlerTestBase.testV2toV3SchemaEvolution(PartitionStatsHandlerTestBase.java:695)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome job for adding a unit test for this

@ajantha-bhat ajantha-bhat changed the title core: Support DV for partition stats Core: Support DV for partition stats Jul 7, 2025
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the comments around testing have been addressed

@ajantha-bhat ajantha-bhat mentioned this pull request Jul 7, 2025
13 tasks
@stevenzwu
Copy link
Contributor

merge this now. if there are more review comments, we can follow up separately

@stevenzwu stevenzwu merged commit 9f91295 into apache:main Jul 7, 2025
42 checks passed
@stevenzwu
Copy link
Contributor

thanks @ajantha-bhat for the contribution and @pvary @nastra for the reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support DV for partition stats

4 participants