Dynamic Bucket for Flink streaming with Partitioned RLI #18514

cshuo · 2026-04-17T09:21:35Z

cshuo
Apr 17, 2026
Collaborator

Background

Hudi currently provides multiple bucket-based indexing options, but all have practical limitations for continuously growing production workloads.

Simple bucket index

The main limitation of simple bucket index is weak rescaling capability: the number of buckets is fixed once configured, when data keeps growing over time, each bucket may accumulate more and more records which can eventually hurt query performance.

Partition-level bucket index

The partition-level bucket index improves flexibility by allowing different partitions to use different bucket numbers. However, once configured for a partition, the bucket count is still not dynamically scalable online, and can only be rescaled through an offline rewrite process.

Consistent bucket index

The consistent bucket index supports bucket split and merge, but it also has limitations: 1) The overall bucket resize lifecycle is coupled with clustering; 2) Before clustering completes, writes during the transition rely on dual-write semantics, which introduces extra write overhead and affects write performance. For Flink streaming workloads, this model is too heavy to use.

This proposal takes a different direction:

Reuse Hudi's partitioned RLI as the source of truth for bucket assigning.
Dynamic bucket growth only affects new keys, old keys always go to the original bucket, thus no split/merge for original bucket.

With this approach, dynamic bucket growth can be performed online during streaming ingestion and is also lightweight, without coupling with any background table service and heavy dual-write.

Goals

Support dynamic bucket assigning built on top of partitioned RLI.
Keep bucket assignment immutable once a record key is assigned to avoid historical data relocation for bucket growth.
Support lazy bootstrap of key -> bucket cache from partitioned RLI.
Keep memory usage bounded through partition-granularity cache lifecycle management.
Reuse existing MDT / RLI infrastructure as much as possible.

Non Goals

Introducing a new hash-based or consistent-hashing bucket index.
Rebalancing historical keys after bucket growth.
Solving hot-key skew caused by a small set of existing keys.
Multiple writers scenario.

The Design

The high-level ideas:

Use partitioned RLI as the persistent backend for dynamic bucket assigning.
Set the initial bucket/file-group count to the number of bucket assigners.
Maintain per-partition mapping cache in the bucket assigner, and is lazily bootstrapped from partitioned RLI.
- Partition -> { recordKey -> fileId }
The memory usage of the cache is bounded and can be spilled to disk.
Support partition-granularity cache eviction after commit and inactivity
Support MDT lookup for the index data of a specific data partition

The Impl

The Initial Bucket Count

The initial bucket count is set to the number of bucket assigners.

This gives a natural initial routing space and aligns initial bucket layout with write parallelism.

The capacity for each bucket can be defined as the maximum file size of the latest file slice.
- For new keys, existing buckets are preferred to avoid small files problem; new buckets are created until the capacity of all the existing buckets are exceeded.

The Bucket Assigning Strategy

We are not calculating bucket id based on static hash strategy anymore. Instead, for each data partition, maintain the per-partition mapping cache:

Partition -> { recordKey -> fileId}

Assigning behavior:

if the record key already exists in the mapping, use the existing bucket id/fileGroup id
if the record key does not exist:
- Select a bucket which is not 'full', and assign the bucket to the record key.
- Create a new bucket if all the existing buckets are 'full',

Lazy Bootstrap for Bucket Assign Cache

The recordKey -> fileId cache is bootstrapped lazily from partitioned RLI.

Behavior:

No eager preload for all partitions
When a partition becomes active, load its routing mapping from partitioned RLI
Once loaded, serve the bucket assigning from the cache.
Set a total memory cap for the cache, and the cache will spill to disk/rocksdb if exceeding the limit.

This keeps memory proportional to active partitions instead of total table size.

Cache Eviction

Bucket assigning Cache is managed at partition granularity, there a flag for each Partition bucket cache:

lastUpdatedCheckpoint: denote the last checkpoint interval during which the bucket cache is updated.

Eviction flow

When bucket assigner assigns bucket for a record key:

If the record key does not exist in the cache, then update the cache and lastUpdatedCheckpoint as the current checkpoint id.

When the bucket assigner operator receives a checkpoint complete notification:

Get the latest successful checkpoint id lastestSuccessfulCheckpoint correspoinding to the latest completed instant.
Save the lastestSuccessfulCheckpoint in bucket assign operator, which will be used to decide whether a bucket assign cache is evictable.
The bucket assign cache is evicted lazily:
- If the total memory usage of the cache doesn't exceed the limit, the bucket assign cache for a partition will not be evicted even it's inactive.
- Before creating a new bucket assign cache for a new partition, if there is no enough memory, the inactive cache will be evicted.

The eviction strategy can avoid unbounded cache growth while keeping hot partitions resident.

The Write of Partitioned RLI

The index metadata is stored under the partitioned RLI in MDT, since the index write pipeline for global RLI is already supported, we can reuse the pipeline for partitioned RLI.

Index write rules:

Only insert index entries for new record keys
Existing keys never update their bucket assignment

This keeps index maintenance simple and the partitioned RLI data updated incrementally.

The Lookup Path

The system should support lookup query of partitoned RLI data for a specific data partition from MDT.

public class HoodieBackedTableMetadata extends BaseTableMetadata {
    ...
    /**
     * Reads record locations from partitioned record-level index with a specified data partition.
     */
    public HoodiePairData<String, HoodieRecordGlobalLocation> readRecordIndexLocations(String dataTablePartition);
    ...
}

Concurrent Writers

This design currently does not work well under the concurrent writers scenario. The main risks are:

Conflicting bucket assignment for the same new key
- two writers may assign the same new key to different buckets
- this breaks the correctness of the key -> bucket mapping
Conflicting bucket creation during bucket expansion
- two writers may generate the same bucket id but bind it to different file groups
- this creates bucket ownership conflicts, similar to the problem in simple bucket index.

Because of these risks, concurrent writers are not supported without additional coordination.

Benefits

No need to move/rewrite historical data when bucket count grows.
No bucket lineage or transitional read semantics are required.
Update routing remains simple because the same key always maps to the same bucket.
Reuses Hudi's existing partitioned RLI / MDT infrastructure.
More natural for workloads where new keys continuously arrive over time.

Tradeoffs / Risks

The cache can become large for hot, high-cardinality partitions.
First access to a large partition may incur bootstrap latency.
Bucket growth only helps future new keys, not existing hot keys.

Summary

This proposal introduces a Partitioned RLI-based Dynamic Bucket Index for Hudi.

The key idea is to use partitioned RLI as the persistent routing backend for a stable per-key bucket assignment:

initial bucket count is set by bucket assigner parallelism
the cache is bootstrapped lazily per partition, and is evicted gradually to avoid OOM.
only index entries for new keys are written into RLI

In short, this design treats dynamic bucket routing as an explicit metadata indexing problem, using Hudi's own partitioned RLI as the source of truth.

danny0405 · 2026-04-20T01:51:49Z

danny0405
Apr 20, 2026
Collaborator

overall looks good, can you clarify these items:

the small file profile for assigning new keys to existing buckets, there are two metrics: the row count and file size(file group/base file), let's decide which one do we want here. and we need a way to calculate or estimate the values.
the read of partitioned RLI from specific partiiton, is there any read amplification? for e.g, is the partition index mappings scatter among multipe buckets or stored together with other partitions within one RLI bucket.

0 replies

cshuo · 2026-04-20T07:10:31Z

cshuo
Apr 20, 2026
Collaborator Author

the small file profile for assigning new keys to existing buckets, there are two metrics: the row count and file size(file group/base file), let's decide which one do we want here. and we need a way to calculate or estimate the values.

Currently, we already have BucketAssigner for assigning buckets based on small file profiling, which calculates target maximum row count for each bucket by parquetMaxFileSize / avgRecordSize. Regarding the first version of dynamic bucket index, I think we can reuse the same profile logic.

the read of partitioned RLI from specific partiiton, is there any read amplification? for e.g, is the partition index mappings scatter among multipe buckets or stored together with other partitions within one RLI bucket.

For partitioned RLI, mappings are organized by data partition. They are not mixed together with mappings from other data partitions, thus there is no cross-partition read amplification. Concretely, the file Group ID naming for partitioned RLI is : record-index-<escapedDataPartitionName>-<4-digit fileGroupIndex>-0.

Bach data partition owns its own partitioned-RLI file group set
By default, one data partition uses 1 RLI file group, and can be configured with a larger value if necessary.

0 replies

nsivabalan · 2026-04-27T21:41:03Z

nsivabalan
Apr 27, 2026
Collaborator

Thanks @cshuo for the detailed writeup — the problem statement is clear and the motivation around limitations of existing bucket indexes is well articulated.

I had a question about the design choice that I'd like to understand better.

The core of this proposal is: use partitioned RLI as the persistent key → bucket mapping, lazily load it into an in-memory cache, and look up every key against that cache for routing. The bucket assignment is immutable once written.

But if we're already paying the cost of maintaining partitioned RLI and doing per-key lookups against it, I'm wondering — what does the bucket index abstraction add on top of just using partitioned RLI directly?

Consider the standard write path with partitioned RLI (no bucket index):

Key lookup: RLI tells you which file group a key belongs to → route the record there. Same as this proposal.
Small file handling: The existing BucketAssigner / WriteProfile infrastructure already profiles file sizes, routes new inserts to small files first, and creates new file groups only when existing ones are full. This is essentially the same "select a non-full bucket, create a new one if all are full" logic described here.
Lazy bootstrap + cache eviction: Same approach would apply — load a partition's RLI mappings on demand, evict when idle.

The main difference I see is that this proposal makes bucket assignment immutable forever, which is presented as a benefit (no data relocation). But this also means:

You can never rebalance skewed file groups
Clustering cannot freely reorganize file layout — it's constrained by the fixed key-to-bucket mapping
If early assignments turn out suboptimal, you're stuck with them

With plain partitioned RLI, clustering can merge small file groups, split large ones, re-sort data — and simply update the RLI. The layout remains fully optimizable over time, which seems strictly more flexible.

I'd also like to flag the workload profile assumption here. The lazy bootstrap + partition-granularity cache eviction works well for fact table workloads where only recent partitions are actively written to — older partitions go cold, their caches get evicted, and memory stays bounded. But for dimension table workloads, where updates arrive across all partitions randomly and continuously, most partitions stay hot. In that scenario, the cache effectively needs to hold key → bucket mappings for the entire table in memory, and partition-level eviction provides little relief. How would this design handle such workloads without running into memory pressure?

So the question is: is there a specific capability or property that the bucket index framing provides, beyond what partitioned RLI with the existing small file handling already gives us? If the answer is primarily the file naming convention and compatibility with bucket index readers, that might not justify the immutability constraint and the workload limitations. Would love to hear your thoughts.

3 replies

danny0405 Apr 28, 2026
Collaborator

we actually had more discussions offline and the initial idea may not work well because of the hash code conflicts, when the hash code conflicts, we can not decide if the key really exists even if the hash code equals, so we are deciding to directly utilize the partitioned RLI, but with the bucket index style file group id format, so that the local record key -> location mappings could be smaller (a short numeric can represent the local bucket per partition).

The partitioned RLI would be loaded on demand and evicted when not used(when beyond the current checkpoint cycle and no usage detection).

cc @cshuo to update with the latest ideas.

cshuo Apr 28, 2026
Collaborator Author

With plain partitioned RLI, clustering can merge small file groups, split large ones, re-sort data — and simply update the RLI. The layout remains fully optimizable over time, which seems strictly more flexible.

The doc is little stale, I will update soon. Actually, what you mentioned here is the direction we have chosen for the proposal, using plain partitioned RLI and common file naming convention(not bucket index style).
The motivation for the original dynamic bucket index abstraction is for the high memory efficiency of the RLI cache, for e.g., we can store hash value of record key -> bucket id, then only 1 GB of memory is required for 100 million keys. However, considering we have to support RLI streaming write at same time, the solution does not work, since the RLI cache should always store the complete record key to determine whether a key already exists.

But for dimension table workloads, where updates arrive across all partitions randomly and continuously, most partitions stay hot. In that scenario, the cache effectively needs to hold key → bucket mappings for the entire table in memory, and partition-level eviction provides little relief. How would this design handle such workloads without running into memory pressure?

Good point. For this workload, partition-level eviction is not expected to provide much benefit if most partitions are continuously hot. The design relies on a few controls here:

The per-partition cache is not a pure in-memory cache, it can spill, like ExternalSpillableMap. We enforce a total heap budget through a config option; once the in-memory portion exceeds the limit, entries can spill to local disk/RocksDB. So in the worst case we bound heap usage and trade lookup latency for memory safety.
The cache is not table-wide map. Each assigner only caches record keys that belong to its own key-group range, so the key -> bucket mapping is sharded by the bucket-assign parallelism instead of being duplicated on every task.

So for a dimension-table style workload where almost all partitions are hot, this proposal would not magically avoid maintaining a large working set; it bounds heap and spills/loads as needed. Operators would need to size the total cache, local spill storage, and assigner parallelism according to the update working set. We can also document this as a trade-off/limitation.

cshuo Apr 28, 2026
Collaborator Author

doc updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Bucket for Flink streaming with Partitioned RLI #18514

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dynamic Bucket for Flink streaming with Partitioned RLI #18514

Uh oh!

Uh oh!

cshuo Apr 17, 2026 Collaborator

Background

Goals

Non Goals

The Design

The Impl

The Initial Bucket Count

The Bucket Assigning Strategy

Lazy Bootstrap for Bucket Assign Cache

Cache Eviction

The Write of Partitioned RLI

The Lookup Path

Concurrent Writers

Benefits

Tradeoffs / Risks

Summary

Replies: 3 comments · 3 replies

Uh oh!

danny0405 Apr 20, 2026 Collaborator

Uh oh!

Uh oh!

cshuo Apr 20, 2026 Collaborator Author

Uh oh!

nsivabalan Apr 27, 2026 Collaborator

Uh oh!

danny0405 Apr 28, 2026 Collaborator

Uh oh!

Uh oh!

cshuo Apr 28, 2026 Collaborator Author

Uh oh!

cshuo Apr 28, 2026 Collaborator Author

cshuo
Apr 17, 2026
Collaborator

Replies: 3 comments 3 replies

danny0405
Apr 20, 2026
Collaborator

cshuo
Apr 20, 2026
Collaborator Author

nsivabalan
Apr 27, 2026
Collaborator

danny0405 Apr 28, 2026
Collaborator

cshuo Apr 28, 2026
Collaborator Author

cshuo Apr 28, 2026
Collaborator Author