Skip to content

feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics#18761

Open
yaoliclshlmch wants to merge 1 commit into
apache:masterfrom
yaoliclshlmch:yaoli/18735_add_metrics_for_cache_hit_ratio
Open

feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics#18761
yaoliclshlmch wants to merge 1 commit into
apache:masterfrom
yaoliclshlmch:yaoli/18735_add_metrics_for_cache_hit_ratio

Conversation

@yaoliclshlmch
Copy link
Copy Markdown

@yaoliclshlmch yaoliclshlmch commented May 17, 2026

Describe the issue this Pull Request addresses

The Flink bucket-assign operator on the global record-level index (GRLI) path keeps an in-memory RecordIndexCache to serve record-key lookups, falling back to the metadata table on a miss. Today the cache is a black box at runtime — operators have no visibility into how effective it is, which makes it impossible to alert on a degraded hit ratio (e.g. due to undersized cache, eviction churn, or key-skew patterns) or to tune INDEX_RLI_CACHE_SIZE based on real workload behavior.
This PR adds two counter metrics so the in-memory cache hit / miss volume becomes observable and alertable.

Summary and Changelog

Adds two Flink counter metrics on the bucket-assign operator's MetricGroup, registered automatically whenever the GRLI backend is in use:

  • bucketassign.minibatch.cache.hit.count — record-key lookups served by the in-memory RecordIndexCache.
  • bucketassign.minibatch.cache.miss.count — record-key lookups that fell back to a metadata-table read.
    Changelog:
  • New org.apache.hudi.metrics.FlinkBucketAssignMetrics (mirrors the existing FlinkRocksDBIndexMetrics pattern):
    • Owns two SimpleCounter instances and registers them under the supplied MetricGroup.
    • Exposes markCacheHit() / markCacheHit(long) and markCacheMiss() / markCacheMiss(long) helpers; the (long) overloads short-circuit on n <= 0 so a single call site can drain a whole batch's hit/miss counts in one shot.
  • org.apache.hudi.sink.partitioner.index.GlobalRecordLevelIndexBackend:
    • Overrides registerMetrics(MetricGroup) to instantiate FlinkBucketAssignMetrics (re-registration guarded, matching RocksDBIndexBackend).
    • In get(List<String> recordKeys) — which the single-key get(String) overload also delegates to — bumps the hit/miss counters by recordKeys.size() - missedKeys.size() and missedKeys.size() after the existing miss-collection loop. The increment is null-guarded so call sites that never call registerMetrics (e.g. unit-test harnesses) are unaffected.
  • New tests TestFlinkBucketAssignMetrics and three additional cases in TestGlobalRecordLevelIndexBackend.
    No code copied from other projects.

Impact

  • User-facing: Two new Flink counter metrics are exposed under the bucket-assign operator's metric group. Existing dashboards/alerts are unchanged; the new metrics are purely additive.
  • Public API: None. No new configuration keys, no behavior changes to bucket assignment, cache eviction, or metadata-table reads. The IndexBackend#registerMetrics extension point already existed.
  • Performance: Negligible. The hot path adds one subtraction and at most two Counter.inc(long) calls per batch lookup, after the existing miss-collection loop.

Risk Level

low
The change is observability-only: cache lookup semantics are unchanged, no new public APIs, no new config keys, and the increment path is a constant-time pair of counter bumps after the existing miss-collection loop. Behavior in tests that don't register metrics is preserved by the null-guard.
Verification: 46 tests across the bucket-assign + metrics surface pass with -Pspark3.3,flink1.18 (TestFlinkBucketAssignMetrics, TestGlobalRecordLevelIndexBackend, TestMinibatchBucketAssignFunction, TestBucketAssigner, TestRecordIndexCache, TestRocksDBIndexBackend, TestFlinkCompactionMetrics).

Documentation Update

none — no new config or user-facing feature beyond two additional metrics emitted by the existing operator. The metric names are self-descriptive and follow the same convention as FlinkRocksDBIndexMetrics.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 17, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.16%. Comparing base (b934633) to head (508d5d8).

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18761   +/-   ##
=========================================
  Coverage     68.15%   68.16%           
- Complexity    29148    29163   +15     
=========================================
  Files          2521     2522    +1     
  Lines        141353   141380   +27     
  Branches      17549    17550    +1     
=========================================
+ Hits          96343    96372   +29     
- Misses        37076    37078    +2     
+ Partials       7934     7930    -4     
Flag Coverage Δ
common-and-other-modules 44.45% <100.00%> (+0.01%) ⬆️
hadoop-mr-java-client 44.98% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.31% <ø> (-0.01%) ⬇️
spark-java-tests 48.92% <ø> (-0.01%) ⬇️
spark-scala-tests 44.86% <ø> (+<0.01%) ⬆️
utilities 37.57% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/metrics/FlinkBucketAssignMetrics.java 100.00% <100.00%> (ø)
...rtitioner/index/GlobalRecordLevelIndexBackend.java 89.83% <100.00%> (+1.59%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yaoliclshlmch yaoliclshlmch changed the title [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics feat: [HUDI-18735][FLINK] Add bucketassign minibatch cache hit/miss metrics May 17, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds two simple counter metrics (bucketassign.minibatch.cache.hit.count / miss.count) to the GRLI backend by mirroring the existing RocksDBIndexBackend / FlinkRocksDBIndexMetrics pattern, with a re-registration guard and a null-guard at the call site. The hit/miss arithmetic (recordKeys.size() - missedKeys.size() vs missedKeys.size()) lines up with how the existing miss-collection loop builds missedKeys, and the test coverage exercises both the single-key and batch paths plus the unregistered-metrics path. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and delegation nits below, but overall the code is clean and well-structured.

cc @yihua

}

public void markCacheHit() {
cacheHitCount.inc();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you have the no-arg markCacheHit() delegate to markCacheHit(1L) (and same for markCacheMiss())? Right now the two overloads are independent, so if any logic is ever added to the bulk path the single-increment path would silently miss it.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants