feat(blob): default blob.inline.mode to DESCRIPTOR for Lance by voonhous · Pull Request #18744 · apache/hudi

voonhous · 2026-05-15T09:13:58Z

Describe the issue this Pull Request addresses

Summary and Changelog

Flip the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR and make DESCRIPTOR a true metadata-only mode for INLINE rows: plain reads on Lance tables no longer materialize inline blob payloads on every row, and read_blob() is unsupported on INLINE rows under that mode (errors with an actionable message). CONTENT remains available as an opt-in for byte-heavy workloads.

The 2-hop pread fallback that initially kept read_blob() working under DESCRIPTOR was retired: it conflated implementation detail (the synthesized reference pointer into the .lance file's storage layout) with durable user-facing metadata, leading to confusing queries like SELECT read_blob(col), col.reference.offset that returned both bytes and a Lance-internal offset.

Changes:

HoodieReaderConfig: default flipped to DESCRIPTOR; doc updated to call out that DESCRIPTOR is metadata-only for INLINE rows.
BatchedBlobReader: INLINE branch now throws IllegalStateException (message names both INLINE and DESCRIPTOR) when it encounters an INLINE row with inline_data=NULL — i.e. the Lance + DESCRIPTOR shape. The CONTENT-mode 1-hop passthrough is preserved.
ScalarFunctions: read_blob DESCRIBE FUNCTION text gains a one-line caveat about the INLINE+DESCRIPTOR throw.
TestLanceDataSource:
- testBlobInlineRoundTrip opts into CONTENT explicitly.
- testBlobInlineDescriptorMode now asserts the descriptor-shape on plain reads AND that read_blob() throws under DESCRIPTOR with an INLINE+DESCRIPTOR error message, plus a CONTENT-mode read_blob() roundtrip to pin write→read integrity.
- testBlobInlineCompactionRoundTrip (MOR) verifies INLINE bytes survive compaction under the default.
- New testBlobMixedInlineAndOutOfLineContentMode pins behavior of mixed-storage tables (INLINE + OUT_OF_LINE rows in one column) under CONTENT mode with read_blob().
rfc/rfc-100/rfc-100.md: behavior matrix, prose, and both mermaid diagrams updated to reflect the new contract (INLINE+DESCRIPTOR+read_blob() errors; OUT_OF_LINE and CONTENT-mode paths unchanged).

The existing compaction-side CONTENT pin in HoodieSparkLanceReader remains and is exercised end-to-end by the MOR compaction test.

Impact

User-facing behavior change for Lance INLINE blobs:

SELECT * now returns data=NULL plus a synthesized reference by default (was data=bytes, reference=NULL).
read_blob(col) on INLINE rows now throws under the default mode. Callers must either:
- Set hoodie.read.blob.inline.mode=CONTENT and use read_blob(col) as before (1-hop), or
- Read col.data directly under CONTENT mode.

The error message names both INLINE and DESCRIPTOR and tells the user how to fix it, so the failure is actionable rather than mysterious.

Parquet is unaffected. The Parquet reader does not currently honor hoodie.read.blob.inline.mode, so on Parquet tables the config is ignored and INLINE rows always come back in CONTENT shape regardless of the setting. read_blob() never sees inline_data=NULL on Parquet, so the new throw is unreachable there.

OUT_OF_LINE is unaffected under any mode. The mode setting governs INLINE reads only; OUT_OF_LINE descriptors are real user metadata and read_blob() always materializes via external pread.

Why erroring is better than the 2-hop fallback we initially shipped: The synthesized reference for INLINE+DESCRIPTOR is a pointer into the .lance file's storage layout — an implementation detail, not user-facing storage. Silently doing the 2-hop materialization invited users to write queries that projected col.reference.offset alongside read_blob(col) and treat the Lance-internal offset as durable metadata. The error aligns the mode's behavior with its name: DESCRIPTOR is metadata-only.

Risk Level

Medium. The default flip plus the new throw is a behavior change for any Lance caller that uses read_blob() on INLINE blobs. Mitigations:

Restore the prior behavior by setting hoodie.read.blob.inline.mode=CONTENT. This is also the right setting for any byte-heavy workload (image processing, ML feature extraction, full-table exports).
The error message is actionable and names the config to flip.
Compaction-side reader is hard-pinned to CONTENT and the MOR compaction test exercises the round-trip end-to-end.
The new mixed-storage test pins behavior of tables that hold both INLINE and OUT_OF_LINE rows in a single column.

Documentation Update

HoodieReaderConfig.BLOB_INLINE_READ_MODE description updated.
ScalarFunctions.read_blob DESCRIBE FUNCTION text gains a caveat block about the INLINE+DESCRIPTOR throw.
RFC-100 visual section, behavior matrix, and prose refreshed to reflect the new contract.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! This flips the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR so Lance INLINE blobs are no longer materialized on every plain read, with a BatchedBlobReader fallback so read_blob() still works under either shape. A few design-level points could use more attention — see inline comments. Once those are addressed, a Hudi committer or PMC member can take it from here for a deeper review.

hudi-agent · 2026-05-15T09:16:24Z

+- For DESCRIPTOR + Lance, hop 2 is a raw filesystem `pread` against the `.lance` file at the descriptor's `(offset, length)`. It bypasses Lance's decoder entirely — the blob encoding (`lance-encoding:blob=true` on a `LargeBinary` column) stores blob bytes contiguously at the position Lance reports, so direct byte access is safe.
+- Plain `SELECT col` (no `read_blob`) is always 1 hop. DESCRIPTOR's win is that hop 1 skips blob decoding when bytes aren't needed.
+
 ### 3. Writer


🤖 The RFC notes the DESCRIPTOR + Lance hop-2 read "bypasses Lance's decoder entirely" via raw pread at the recorded offset/length. Is there a stability contract from Lance that the (offset, length) reported on a blob-encoded LargeBinary column is guaranteed to be the on-disk byte range across Lance versions? If Lance ever switches blob storage to a chunked/compressed layout, raw pread would silently return garbage. Worth either citing the Lance contract here or noting the version pin / detection strategy.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-15T09:16:24Z

@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that already exposes a `(file

 **Visual**


🤖 It would help to add a short "Alternatives considered" note for the default flip — e.g., gating DESCRIPTOR per-table/per-query vs flipping the global default, or making DESCRIPTOR the default only when the engine signals projection without read_blob(). The current text presents the flip as the only option; documenting why a global flip is preferable would strengthen the RFC.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Flip the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR so plain reads (e.g. SELECT *) on Lance tables stop paying the I/O cost of materializing inline blob payloads. CONTENT remains available as an opt-in for callers that need the raw bytes in `data`. The compactor's CONTENT pin is already in place in HoodieSparkLanceReader, so compaction continues to read bytes regardless of the user-facing default and round-trips them through the base-file rewrite. Also fix BatchedBlobReader: under DESCRIPTOR mode, INLINE rows arrive with null `inline_data` and a synthesized reference. The previous dispatch short-circuited on storage_type=INLINE and read the null field, breaking read_blob(). The INLINE branch now falls back to the descriptor's positional range read when inline_data is absent, so read_blob() works in both modes. Tests: - testBlobInlineRoundTrip now opts into CONTENT explicitly (it exists specifically to validate CONTENT semantics). - New testBlobInlineCompactionRoundTrip verifies INLINE bytes survive MOR compaction read/rewrite under the new default, asserted via the realistic user paths (plain read + read_blob()).

Replace the ASCII visual with two mermaid diagrams: a storage-rooted dispatch tree showing what the user gets back for SELECT col vs SELECT read_blob(col) across modes and file formats, and a hop-count view of read_blob() byte resolution that makes the CONTENT vs DESCRIPTOR + Lance trade-off explicit.

…o-op - Swap CONTENT/DESCRIPTOR default labels in the mermaid diagrams and surrounding prose so the doc matches HoodieReaderConfig's new default. - Make the matrix's Parquet + DESCRIPTOR row reflect current behavior (config is ignored on Parquet today; rows come back in CONTENT shape) with a footnote, and split the eventual spec'd behavior into its own paragraph marked not-yet-implemented. Mermaid Parquet branches updated to match.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! The default flip plus the BatchedBlobReader fallback and new compaction round-trip test together make a coherent story, and the mermaid diagrams are a nice readability win. A few architectural questions remain around lifecycle of the synthetic descriptor and an undocumented invariant in the compaction reader path that I think are worth surfacing in the RFC itself before a Hudi committer or PMC member takes it from here.

hudi-agent · 2026-05-15T11:40:41Z

+    F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = synthetic managed<br/>path, offset, length, is_managed=true"]
+    F -->|"Parquet (today: mode no-op)"| PD["Parquet reader does not<br/>implement DESCRIPTOR yet —<br/>returns CONTENT shape:<br/>inline_data = bytes, reference = NULL"]
+
+    QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}


🤖 The Lance branch shows reference = synthetic managed path, offset, length, is_managed=true. Could you spell out what is_managed=true means semantically and how it differs from a user-supplied OUT_OF_LINE reference downstream — e.g., does it gate cleaning, does read_blob() route differently, can a writer round-trip a row carrying this flag back into the table? This flag appears to be the only signal distinguishing "safe to clean — Hudi owns the bytes" from "do not clean — user owns the bytes", which is load-bearing.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

The INLINE-descriptor fallback in BatchedBlobReader (the branch that preads bytes via the synthesized reference when inline_data is null) is load-bearing for mixed-usage queries. If read_blob() force-flipped the scan back to CONTENT shape, reference.* would all be null in the same projection. Fold a single co-projection SELECT into the existing testBlobInlineDescriptorMode fixture to lock in this contract without adding a new test method.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The descriptor-fallback path in BatchedBlobReader, the MOR compaction round-trip test, and the clarified Parquet no-op documentation address the substantive concerns. Prior rounds have already raised the key design points (compaction CONTENT pin, mid-batch require behavior, synthetic descriptor lifecycle, Parquet semantics). A Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

DESCRIPTOR is a metadata-only mode for INLINE rows: bytes are not materialized during the scan and the synthesized reference is an internal pointer into the .lance file's storage layout, not user-facing metadata. The prior 2-hop pread fallback in BatchedBlobReader silently materialized bytes anyway, conflating the implementation detail with durable storage info — e.g. SELECT read_blob(col), col.reference.offset would return a Lance-internal offset that users could mistake for a real descriptor. Replace the fallback with a runtime IllegalStateException naming both INLINE and DESCRIPTOR so the failure is actionable: set CONTENT mode or read col.data directly. The check is per-row in BatchedBlobReader, so OUT_OF_LINE under any mode and Parquet (where DESCRIPTOR is a no-op today) are unaffected. Tests updated accordingly: testBlobInlineDescriptorMode now asserts the throw and exercises a CONTENT-mode roundtrip alongside, and a new testBlobMixedInlineAndOutOfLineContentMode pins the mixed-storage behavior on Lance.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The RFC and code changes now consistently describe DESCRIPTOR as a metadata-only mode for INLINE rows, with read_blob() throwing an actionable error on Lance+INLINE+DESCRIPTOR, and the diagrams + behavior matrix line up with the implementation. No new critical correctness concerns from this round beyond what prior reviews already flagged — a Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

…n the default testBlobInlineRoundTrip and testBlobInlineCompactionRoundTrip each had a second `spark.read.format("hudi").load(...)` that did not set the inline mode and relied on the prior default returning bytes. With the default flipped to DESCRIPTOR and the 2-hop fallback retired, read_blob() on INLINE rows now throws there. Set hoodie.read.blob.inline.mode=CONTENT on those loads and refresh the surrounding comments.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The mid-batch throw is well-contained, the new MOR compaction round-trip test directly exercises the HoodieSparkLanceReader CONTENT pin against the data-corruption scenario, and the CONTENT-mode roundtrip plus mixed-storage tests pin write→read integrity. Remaining open threads from prior rounds are design-level rather than critical-correctness; a Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

rahil-c · 2026-05-18T18:36:16Z

+                    s"hoodie.read.blob.inline.mode=DESCRIPTOR, INLINE blobs are returned " +
+                    s"as metadata-only (inline_data=NULL, synthesized reference). To read " +
+                    s"bytes, set hoodie.read.blob.inline.mode=CONTENT, or project " +
+                    s"col.data / col.reference.* directly instead of calling read_blob().")


Based on today discussion i think we want to avoid the user to directly call col.data and instead they should be using read_blob().

I can fix the wording

rahil-c · 2026-05-18T18:36:31Z

          |
+          |Caveat:
+          |  Throws on INLINE rows under hoodie.read.blob.inline.mode=DESCRIPTOR.
+          |  Set CONTENT mode or read col.data directly to materialize INLINE bytes.


same as above comment

rahil-c · 2026-05-18T19:08:57Z

Thanks for the comprehensive testing @voonhous. Can you add tests for the cases @yihua was showing today regarding doing a select * from a table has multiple blob cols, like blob_A, blob_B.

select read_blob(blob_col1), blob_col2 from table mode = CONTENT, return content both columns mode = DESCRIPTOR (default), throw error select blob_col1, blob_col2 from table mode = CONTENT, return content both columns mode = DESCRIPTOR (default), return discriptor select read_blob(blob_col1), read_blob(blob_col2) from table return content for columns

Essentially we want to ensure that when a user does select read_blob(blob_A), blob_B from table, with default as DESCRIPTOR that this would throw error as we have a mixed case here for bytes and pointers.

rahil-c · 2026-05-19T16:20:35Z

+   * originals.
+   */
+  @Test
+  def testBlobInlineCompactionRoundTrip(): Unit = {


Ensure that log files are created

yihua · 2026-05-20T00:35:27Z

+                  s"read_blob() cannot materialize bytes for an INLINE blob under " +
+                    s"DESCRIPTOR mode (row $rowIndex). Under " +
+                    s"hoodie.read.blob.inline.mode=DESCRIPTOR, INLINE blobs will not return bytes via the 'data' subfield" +
+                    " and instead returns a 'reference' subfield containing metadata only." +


Suggested change

" and instead returns a 'reference' subfield containing metadata only." +

" and instead returns a 'reference' subfield containing metadata only. " +

yihua · 2026-05-20T00:36:05Z

+                  s"read_blob() cannot materialize bytes for an INLINE blob under " +
+                    s"DESCRIPTOR mode (row $rowIndex). Under " +


Based on our discussion, read_blob() should disregard the inline blob read mode, correct?

meaning that read_blob() should always read the blob content regardless of the config.

Drop the CONTENT-mode read_blob roundtrip from testBlobInlineDescriptorMode (already covered by testBlobInlineRoundTrip across CoW and MoR). Trim testBlobInlineMultipleColumnsPlainSelect to per-column-independence assertions only; the per-column INLINE shape under CONTENT and DESCRIPTOR is already pinned by testBlobInlineRoundTrip and testBlobInlineDescriptorMode. Trim testBlobInlineMultipleColumnsReadBlobAll to a CONTENT-only smoke; the DESCRIPTOR throw path is already pinned by testBlobInlineDescriptorMode and testBlobInlineMultipleColumnsMixedSelect. Net: 28 added, 82 removed.

codecov-commenter · 2026-05-20T02:53:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.23%. Comparing base (071b3f1) to head (990d77a).
⚠️ Report is 31 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18744      +/-   ##
============================================
+ Coverage     68.14%   68.23%   +0.09%     
- Complexity    29094    29295     +201     
============================================
  Files          2517     2525       +8     
  Lines        141113   141740     +627     
  Branches      17508    17615     +107     
============================================
+ Hits          96160    96716     +556     
- Misses        37046    37058      +12     
- Partials       7907     7966      +59

Flag	Coverage Δ
common-and-other-modules	`44.36% <6.25%> (-0.04%)`	⬇️
hadoop-mr-java-client	`44.96% <100.00%> (-0.05%)`	⬇️
spark-client-hadoop-common	`48.27% <100.00%> (-0.06%)`	⬇️
spark-java-tests	`48.88% <100.00%> (-0.10%)`	⬇️
spark-scala-tests	`44.95% <62.50%> (+0.03%)`	⬆️
utilities	`37.47% <6.25%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java	`100.00% <100.00%> (ø)`
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala	`85.21% <100.00%> (+0.73%)`	⬆️
...g/apache/spark/sql/hudi/blob/ScalarFunctions.scala	`92.30% <ø> (ø)`

... and 58 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-20T03:21:16Z

CI report:

990d77a Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous added this to the release-1.2.0 milestone May 15, 2026

voonhous requested review from rahil-c and yihua May 15, 2026 09:13

hudi-agent reviewed May 15, 2026

View reviewed changes

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 15, 2026

voonhous added 3 commits May 15, 2026 19:34

voonhous force-pushed the address-#18742 branch from e6b65ff to a18b113 Compare May 15, 2026 11:34

hudi-agent reviewed May 15, 2026

View reviewed changes

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 16, 2026

hudi-agent reviewed May 16, 2026

View reviewed changes

rahil-c reviewed May 18, 2026

View reviewed changes

rahil-c reviewed May 19, 2026

View reviewed changes

rahil-c and others added 2 commits May 19, 2026 15:11

address my comments on voon pr

ee77abb

Update documentation for INLINE BLOB read mode

d2bace8

yihua reviewed May 20, 2026

View reviewed changes

		@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that already exposes a `(file

		Visual

	" and instead returns a 'reference' subfield containing metadata only." +
	" and instead returns a 'reference' subfield containing metadata only. " +

		s"read_blob() cannot materialize bytes for an INLINE blob under " +
		s"DESCRIPTOR mode (row $rowIndex). Under " +

Conversation

voonhous commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 20, 2026

Codecov Report

Uh oh!

hudi-bot commented May 20, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

voonhous commented May 15, 2026 •

edited

Loading