Bug Description
What happened:
A failed Hudi write can leave behind orphan parquet files — either truncated/incomplete files or complete-but-uncommitted files belonging to an aborted instant. Hudi's Rollback action is responsible for removing them, using either the marker-based strategy (.hoodie/.temp/<instant>/) or the listing-based strategy (matching files against the failed instant's commit time embedded in the filename).
If Rollback itself partially fails — crash mid-rollback, marker loss, or a blocked storage write that lands after rollback already completed (for example, a slow object-store close() call that returns long after the writer task was cancelled by the engine) — some orphan files persist on disk.
While the failed instant and its rollback record remain in the active timeline, the orphans are still recoverable: a retried rollback or a reconciliation pass can match the orphan's filename against the failed instant and delete it.
Once Archive moves the rollback record out of the active timeline (and eventually past archived-timeline retention), the system loses every metadata anchor that would let any process identify those files as orphans. Critically, orphans from failed writes typically have a fileId with no surviving committed sibling slice, so the reader's "latest slice within a file group" logic treats the orphan as a legitimate one-slice file group.
End-user impact:
- Incomplete files → readers fail with corrupt/invalid parquet errors
- Complete-but-uncommitted files → readers return duplicate records
This violates the reader/writer isolation guarantee Hudi provides via the timeline.
What you expected:
Files created by writes that ultimately did not commit should never become reader-visible, regardless of how long the table runs or whether Archive has pruned the rollback record. The reader/writer isolation guarantee should hold across the entire lifecycle of the table, not only while the failed instant happens to still be in the active timeline.
Steps to reproduce:
- Configure any Hudi write path (Flink or Spark; both reproduce). Use
hoodie.clean.failed.writes.policy=LAZY (typical for multi-writer or Flink ingestion).
- Induce a partial-write failure where the data file lands but the instant never commits. One reliable producer in production: a blocked storage
close() that returns after the writer task has been cancelled by the engine. Any storage layer with intermittent congestion or blocking-close semantics under load (object stores, certain HDFS configurations, etc.) makes this easy to observe.
- Let auto-rollback fire (via the next writer detecting an expired heartbeat). Verify some orphan files remain — for example, because the blocked
close() had not yet landed when rollback ran its listing-based scan.
- Let
hoodie.keep.min.commits / hoodie.keep.max.commits archive the failed instant and its rollback record.
- Query the table from any reader path (Spark, Flink, Presto/Trino, Hive). Depending on whether the orphan is truncated or complete, the query either fails with parquet read errors or returns duplicate records.
Proposal
Add an archive precondition that prevents a rollback instant from being archived until its orphans are provably gone.
Concretely, at the archive planner:
- For each rollback instant about to be archived, parse its
HoodieRollbackMetadata to obtain:
- The target instant being rolled back
- The partitions it touched
- The list of files it claimed to delete (
successDeleteFiles ∪ failedDeleteFiles)
- List the partitions named in the rollback metadata and find any files whose embedded instant matches the target instant (regex:
*_<targetInstant>.parquet for base files; .<fileId>_<writeToken>.log.<version>_<targetInstant> for MoR log files).
- Cross-check against the rollback's claimed-deleted list. Any file still present that wasn't in the claimed-deleted list (or was in
failedDeleteFiles) is a candidate orphan.
- Confirm via timeline lookup that the file's instant is not a
COMPLETED commit anywhere in (active ∪ archived) timeline.
- If candidate orphans remain, defer archival of this rollback instant. Two reasonable actions:
- Re-invoke the rollback (which can still find and delete the orphan now using markers or listing), or
- Surface a metric/alert for operator intervention.
This is a per-cycle boundary check — it does not require scanning the entire history every time. The induction property is: as long as no rollback instant is ever archived without proving its orphans gone, no rollback orphan can ever cross into the post-archive "invisible" state.
Complementary measure: extend marker retention. Today markers under .hoodie/.temp/<instant>/ are deleted as the last step of rollback, which means by the time a post-rollback orphan appears (for example, from a blocked close() landing late), the markers are already gone. Retaining markers until the rollback instant itself is archived would give the precondition check an authoritative file list to verify against, rather than relying on partition listings.
One-time repair: a CLI or admin tool that runs the same scan against archived rollback instants (before they age past archived-timeline retention) to clean up any orphans created by pre-existing partial rollbacks before this fix lands.
Bug Description
What happened:
A failed Hudi write can leave behind orphan parquet files — either truncated/incomplete files or complete-but-uncommitted files belonging to an aborted instant. Hudi's Rollback action is responsible for removing them, using either the marker-based strategy (
.hoodie/.temp/<instant>/) or the listing-based strategy (matching files against the failed instant's commit time embedded in the filename).If Rollback itself partially fails — crash mid-rollback, marker loss, or a blocked storage write that lands after rollback already completed (for example, a slow object-store
close()call that returns long after the writer task was cancelled by the engine) — some orphan files persist on disk.While the failed instant and its rollback record remain in the active timeline, the orphans are still recoverable: a retried rollback or a reconciliation pass can match the orphan's filename against the failed instant and delete it.
Once Archive moves the rollback record out of the active timeline (and eventually past archived-timeline retention), the system loses every metadata anchor that would let any process identify those files as orphans. Critically, orphans from failed writes typically have a
fileIdwith no surviving committed sibling slice, so the reader's "latest slice within a file group" logic treats the orphan as a legitimate one-slice file group.End-user impact:
This violates the reader/writer isolation guarantee Hudi provides via the timeline.
What you expected:
Files created by writes that ultimately did not commit should never become reader-visible, regardless of how long the table runs or whether Archive has pruned the rollback record. The reader/writer isolation guarantee should hold across the entire lifecycle of the table, not only while the failed instant happens to still be in the active timeline.
Steps to reproduce:
hoodie.clean.failed.writes.policy=LAZY(typical for multi-writer or Flink ingestion).close()that returns after the writer task has been cancelled by the engine. Any storage layer with intermittent congestion or blocking-close semantics under load (object stores, certain HDFS configurations, etc.) makes this easy to observe.close()had not yet landed when rollback ran its listing-based scan.hoodie.keep.min.commits/hoodie.keep.max.commitsarchive the failed instant and its rollback record.Proposal
Add an archive precondition that prevents a rollback instant from being archived until its orphans are provably gone.
Concretely, at the archive planner:
HoodieRollbackMetadatato obtain:successDeleteFiles∪failedDeleteFiles)*_<targetInstant>.parquetfor base files;.<fileId>_<writeToken>.log.<version>_<targetInstant>for MoR log files).failedDeleteFiles) is a candidate orphan.COMPLETEDcommit anywhere in(active ∪ archived)timeline.This is a per-cycle boundary check — it does not require scanning the entire history every time. The induction property is: as long as no rollback instant is ever archived without proving its orphans gone, no rollback orphan can ever cross into the post-archive "invisible" state.
Complementary measure: extend marker retention. Today markers under
.hoodie/.temp/<instant>/are deleted as the last step of rollback, which means by the time a post-rollback orphan appears (for example, from a blockedclose()landing late), the markers are already gone. Retaining markers until the rollback instant itself is archived would give the precondition check an authoritative file list to verify against, rather than relying on partition listings.One-time repair: a CLI or admin tool that runs the same scan against archived rollback instants (before they age past archived-timeline retention) to clean up any orphans created by pre-existing partial rollbacks before this fix lands.