feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics by yaoliclshlmch · Pull Request #18761 · apache/hudi

yaoliclshlmch · 2026-05-17T07:38:15Z

Describe the issue this Pull Request addresses

The Flink bucket-assign operator on the global record-level index (GRLI) path keeps an in-memory RecordIndexCache to serve record-key lookups, falling back to the metadata table on a miss. Today the cache is a black box at runtime — operators have no visibility into how effective it is, which makes it impossible to alert on a degraded hit ratio (e.g. due to undersized cache, eviction churn, or key-skew patterns) or to tune INDEX_RLI_CACHE_SIZE based on real workload behavior.
This PR adds two counter metrics so the in-memory cache hit / miss volume becomes observable and alertable.

Summary and Changelog

Adds two Flink counter metrics on the bucket-assign operator's MetricGroup, registered automatically whenever the GRLI backend is in use:

bucketassign.minibatch.cache.hit.count — record-key lookups served by the in-memory RecordIndexCache.
bucketassign.minibatch.cache.miss.count — record-key lookups that fell back to a metadata-table read.
Changelog:
New org.apache.hudi.metrics.FlinkBucketAssignMetrics (mirrors the existing FlinkRocksDBIndexMetrics pattern):
- Owns two SimpleCounter instances and registers them under the supplied MetricGroup.
- Exposes markCacheHit() / markCacheHit(long) and markCacheMiss() / markCacheMiss(long) helpers; the (long) overloads short-circuit on n <= 0 so a single call site can drain a whole batch's hit/miss counts in one shot.
org.apache.hudi.sink.partitioner.index.GlobalRecordLevelIndexBackend:
- Overrides registerMetrics(MetricGroup) to instantiate FlinkBucketAssignMetrics (re-registration guarded, matching RocksDBIndexBackend).
- In get(List<String> recordKeys) — which the single-key get(String) overload also delegates to — bumps the hit/miss counters by recordKeys.size() - missedKeys.size() and missedKeys.size() after the existing miss-collection loop. The increment is null-guarded so call sites that never call registerMetrics (e.g. unit-test harnesses) are unaffected.
New tests TestFlinkBucketAssignMetrics and three additional cases in TestGlobalRecordLevelIndexBackend.
No code copied from other projects.

Impact

User-facing: Two new Flink counter metrics are exposed under the bucket-assign operator's metric group. Existing dashboards/alerts are unchanged; the new metrics are purely additive.
Public API: None. No new configuration keys, no behavior changes to bucket assignment, cache eviction, or metadata-table reads. The IndexBackend#registerMetrics extension point already existed.
Performance: Negligible. The hot path adds one subtraction and at most two Counter.inc(long) calls per batch lookup, after the existing miss-collection loop.

Risk Level

low
The change is observability-only: cache lookup semantics are unchanged, no new public APIs, no new config keys, and the increment path is a constant-time pair of counter bumps after the existing miss-collection loop. Behavior in tests that don't register metrics is preserved by the null-guard.
Verification: 46 tests across the bucket-assign + metrics surface pass with -Pspark3.3,flink1.18 (TestFlinkBucketAssignMetrics, TestGlobalRecordLevelIndexBackend, TestMinibatchBucketAssignFunction, TestBucketAssigner, TestRecordIndexCache, TestRocksDBIndexBackend, TestFlinkCompactionMetrics).

Documentation Update

none — no new config or user-facing feature beyond two additional metrics emitted by the existing operator. The metric names are self-descriptive and follow the same convention as FlinkRocksDBIndexMetrics.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2026-05-17T09:10:54Z

CI report:

508d5d8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-17T14:39:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.16%. Comparing base (b934633) to head (508d5d8).

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18761   +/-   ##
=========================================
  Coverage     68.15%   68.16%           
- Complexity    29148    29163   +15     
=========================================
  Files          2521     2522    +1     
  Lines        141353   141380   +27     
  Branches      17549    17550    +1     
=========================================
+ Hits          96343    96372   +29     
- Misses        37076    37078    +2     
+ Partials       7934     7930    -4

Flag	Coverage Δ
common-and-other-modules	`44.45% <100.00%> (+0.01%)`	⬆️
hadoop-mr-java-client	`44.98% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.31% <ø> (-0.01%)`	⬇️
spark-java-tests	`48.92% <ø> (-0.01%)`	⬇️
spark-scala-tests	`44.86% <ø> (+<0.01%)`	⬆️
utilities	`37.57% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/metrics/FlinkBucketAssignMetrics.java	`100.00% <100.00%> (ø)`
...rtitioner/index/GlobalRecordLevelIndexBackend.java	`89.83% <100.00%> (+1.59%)`	⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds two simple counter metrics (bucketassign.minibatch.cache.hit.count / miss.count) to the GRLI backend by mirroring the existing RocksDBIndexBackend / FlinkRocksDBIndexMetrics pattern, with a re-registration guard and a null-guard at the call site. The hit/miss arithmetic (recordKeys.size() - missedKeys.size() vs missedKeys.size()) lines up with how the existing miss-collection loop builds missedKeys, and the test coverage exercises both the single-key and batch paths plus the unregistered-metrics path. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and delegation nits below, but overall the code is clean and well-structured.

cc @yihua

hudi-agent · 2026-05-18T22:03:14Z

+  }
+
+  public void markCacheHit() {
+    cacheHitCount.inc();


🤖 nit: could you have the no-arg markCacheHit() delegate to markCacheHit(1L) (and same for markCacheMiss())? Right now the two overloads are independent, so if any logic is ever added to the bulk path the single-increment path would silently miss it.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

add metrics for bucketassign.minibatch cache hit ratio

508d5d8

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 17, 2026

yaoliclshlmch changed the title ~~[HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics~~ feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics May 17, 2026

hudi-agent reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics#18761

feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics#18761
yaoliclshlmch wants to merge 1 commit into
apache:masterfrom
yaoliclshlmch:yaoli/18735_add_metrics_for_cache_hit_ratio

yaoliclshlmch commented May 17, 2026 •

edited

Loading

Uh oh!

hudi-bot commented May 17, 2026

Uh oh!

codecov-commenter commented May 17, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yaoliclshlmch commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented May 17, 2026

CI report:

Uh oh!

codecov-commenter commented May 17, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaoliclshlmch commented May 17, 2026 •

edited

Loading