Core: Support DV for partition stats#13425
Conversation
| DATA_RECORD_COUNT, | ||
| DATA_FILE_COUNT, | ||
| TOTAL_DATA_FILE_SIZE_IN_BYTES, | ||
| NestedField.required( |
There was a problem hiding this comment.
As per the spec PR (these fields are required now)
#12098
| public static final NestedField LAST_UPDATED_SNAPSHOT_ID = | ||
| NestedField.optional(12, "last_updated_snapshot_id", LongType.get()); | ||
| public static final NestedField DV_COUNT = | ||
| NestedField.required(13, "dv_count", IntegerType.get()); |
There was a problem hiding this comment.
schema id, field name, data type and required field as per spec
#12098
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
| private static PartitionMap<PartitionStats> collectStatsForManifest( | ||
| Table table, ManifestFile manifest, StructType partitionType, boolean incremental) { | ||
| Table table, | ||
| int version, |
There was a problem hiding this comment.
Even though we can deduce the version inside this API, I didn't want to compute the version for each thread. Hence, computed in the caller.
|
cc: @aokolnychyi, @stevenzwu, @pvary, @lirui-apache, @deniskuzZ, @RussellSpitzer: Please take a look. |
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
|
I went through the PR, but I'm not convinced that we need a different type for WDYT? |
Table metadata is JSON file and json can have optional fields and can omit during the write/read serialization. But this is schema based parquet or avro file. If we see the manifest, we do have like Plus I felt it is clean for v2 writers or readers to no have the members of v3 (as null) when they read it. |
The corresponding public interface is I think we should follow the same pattern here. |
|
I agree with @pvary on the reasoning and comparison with metadata and manifiest file A new optional |
|
Thanks @pvary and @stevenzwu for the response. I will try it out your approach and get back on this if any problems for this approach. |
| this.lastUpdatedSnapshotId = (Long) value; | ||
| break; | ||
| case 12: | ||
| this.dvCount = value == null ? 0 : (int) value; |
There was a problem hiding this comment.
defaulting to 0. Just like other counters (pos/eq deletes) were default to 0.
|
@pvary, @stevenzwu : Please take another look. I have addressed the comments. |
|
Restarting the build due to Spark flaky test |
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
|
@pvary, @stevenzwu, @nastra: I have addressed comments and also I found an issue when v3 reading the v2 stats as DV is required field in schema. The reading failed and incremental compute fallback to full compute because of that. I fixed it using "default value" feature of v3. Incremental compute still works with this upgrade. I have added the test. Please take another look at this PR. Thanks. |
| NestedField.required("dv_count") | ||
| .withId(13) | ||
| .ofType(Types.IntegerType.get()) | ||
| .withInitialDefault(Literal.of(0)) |
There was a problem hiding this comment.
Using default 0 for v3. Because when we try to read v2 stats with v3 schema, we get the field not found error without the default values configuration as dv is a required field in schema.
callstack:
Missing required field: dv_count
java.lang.IllegalArgumentException: Missing required field: dv_count
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:219)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:207)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:48)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:59)
at org.apache.iceberg.data.parquet.InternalReader.create(InternalReader.java:40)
at org.apache.iceberg.parquet.Parquet$ReadBuilder.lambda$build$0(Parquet.java:1368)
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:121)
at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:74)
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:94)
at org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:205)
at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:204)
at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:196)
at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:139)
at org.apache.iceberg.PartitionStatsHandlerTestBase.testV2toV3SchemaEvolution(PartitionStatsHandlerTestBase.java:695)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
There was a problem hiding this comment.
awesome job for adding a unit test for this
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java
Outdated
Show resolved
Hide resolved
nastra
left a comment
There was a problem hiding this comment.
LGTM once the comments around testing have been addressed
|
merge this now. if there are more review comments, we can follow up separately |
|
thanks @ajantha-bhat for the contribution and @pvary @nastra for the reviews |
fixes #13180