Spark: Include info about failed commits in the result of RewriteDataFilesAction#15226
Spark: Include info about failed commits in the result of RewriteDataFilesAction#15226omerhadari wants to merge 2 commits intoapache:mainfrom
Conversation
When partial progress is enabled and a commit fails, the file groups in that batch were previously lost. This change tracks them in a new failedRewrites queue and exposes them via the failures() method, allowing RewriteDataFilesSparkAction to include them in the result's rewriteFailures list.
There was a problem hiding this comment.
Pull request overview
This PR ensures commit failures are reflected in RewriteDataFilesAction results when partial progress is enabled, so failed file groups are no longer “lost” from the aggregated result.
Changes:
- Track and expose file groups that failed during commit batches in
BaseCommitService. - Include commit-failed file groups in
RewriteDataFilesSparkAction’s aggregatedrewriteFailuresandfailedDataFilesCount. - Extend Spark action tests across Spark versions to assert failures/counts when a commit fails.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java | Adds commit-failed file groups into the action result’s failure aggregation. |
| spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java | Asserts failures and failed data file counts for commit-failure partial progress. |
| spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java | Same as Spark 4.1 change for older Spark line. |
| spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java | Same as Spark 4.1 test tightening for older Spark line. |
| spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java | Same as Spark 4.1 change for Spark 3.5 line. |
| spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java | Same as Spark 4.1 test tightening for Spark 3.5 line. |
| spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java | Same as Spark 4.1 change for Spark 3.4 line. |
| spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java | Same as Spark 4.1 test tightening for Spark 3.4 line. |
| core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java | Tracks failed commit batches’ groups and exposes them via a new failures() API. |
| core/src/test/java/org/apache/iceberg/actions/TestCommitService.java | Adds test coverage to ensure commit failures are tracked and queryable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** Returns all File groups which failed to commit */ | ||
| public List<T> failures() { | ||
| Preconditions.checkState( | ||
| committerService.isShutdown(), | ||
| "Cannot get failures from a service which has not been closed"); | ||
| return Lists.newArrayList(failedRewrites.iterator()); | ||
| } |
There was a problem hiding this comment.
The new Javadoc and precondition message are a bit unclear/grammatically awkward. Consider updating to clarify that these are file groups from commit batches that threw, and tweak wording/capitalization (e.g., “file groups that failed to commit” and “service that has not been closed”).
There was a problem hiding this comment.
same exact wording as here in the same file https://github.com/apache/iceberg/pull/15226/changes#diff-f048416da96d0fc04bbd5760809a16301c63cd838c27fbd1070efdbf3f29c704R160
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
…s/TestRewriteDataFilesAction.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When partial progress is enabled and a commit fails, the file groups in that batch were previously lost. When looking at the result,
failedDataFilesCountis0when commits failed.When some commits failed and some succeeded, it's unclear by looking at the result that it is actually partial.
One choice I'm not sure about and am happy to get feedback on is that this change makes it so the creation of result objects now treats commit failures as other rewrite failures (both failure count and failed data files).
Given the aggregative nature of the result object, I couldn't think of a situation where the distinction is important enough to justify changing its structure, but I am open to changing it to be counted separately.
I haven't seen an AI policy, but FYI I used AI to implement tests and changed them manually to fit the style other similar tests in the same area.