Skip to content

[HUDI-18496] Respect configured record type when filtering index keys#18715

Open
201573 wants to merge 1 commit into
apache:masterfrom
201573:hudi-18496-index-reader-record-type
Open

[HUDI-18496] Respect configured record type when filtering index keys#18715
201573 wants to merge 1 commit into
apache:masterfrom
201573:hudi-18496-index-reader-record-type

Conversation

@201573
Copy link
Copy Markdown

@201573 201573 commented May 10, 2026

Describe the issue this Pull Request addresses

Closes #18496

Summary and Changelog

  • Adds file-format record type resolution so formats that require Spark readers can override the configured fallback.
  • Threads the configured record merger type into index key filtering paths instead of always using AVRO readers.
  • Adds unit coverage for record type resolution by file format and extension.

Impact

Keeps index key filtering consistent with the configured record merger type while preserving the existing AVRO fallback for existing callers.

Risk level

Low. The previous filterKeysFromFile signature still defaults to AVRO, and format-specific overrides are limited to the file format resolution helper.

Test Plan

  • git diff --check
  • mvn -pl hudi-common -DskipITs -Dcheckstyle.skip -DfailIfNoTests=false -Dtest=TestHoodieFileFormat test
  • mvn -pl hudi-client/hudi-client-common,hudi-client/hudi-spark-client -am -DskipTests -DskipITs compile

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 10, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@201573
Copy link
Copy Markdown
Author

201573 commented May 12, 2026

Gentle ping for review when convenient. Azure CI is green on the latest update.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR threads the configured record merger type through to index key filtering so Lance base files can use a Spark reader, while keeping AVRO as the default fallback. One potential regression worth double-checking in the inline comments around ORC/HFile base files when a Spark record merger is configured. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.

}

public HoodieRecord.HoodieRecordType resolveRecordType(HoodieRecord.HoodieRecordType fallback) {
return requiresSparkRecordType() ? HoodieRecord.HoodieRecordType.SPARK : fallback;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Could this introduce a regression for .orc (and .hfile) base files? Previously filterKeysFromFile always used the AVRO reader factory (which supports Parquet/ORC/HFile via HoodieAvroFileReaderFactory). After this change, when config.getRecordMerger().getRecordType() returns SPARK, resolveRecordTypeForExtension(".orc", SPARK) returns SPARK — and HoodieSparkFileReaderFactory.newOrcFileReader throws HoodieIOException("Not support read orc file") (same for HFile). It might be safer to special-case Lance only and keep returning AVRO for the others, or override the fallback only when requiresSparkRecordType() is true (which it already does — the concern is whether getRecordMerger().getRecordType() being SPARK + ORC base file was previously working via the AVRO path). @yihua could you confirm whether ORC bloom-index filtering with a Spark merger was previously functional?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HoodieIndexUtils.filterKeysFromFile hardcodes AVRO fallback, ignoring configured record merger type

3 participants