Skip to content

Conversation

the-other-tim-brown
Copy link
Contributor

Describe the issue this Pull Request addresses

Adds updates to allow multiple file slices for Column Group to control file sizing, also raises new questions for discussion.

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…ion, update some terminology to match 1.0 classes
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Oct 15, 2025

## Background
Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columngroup concept.
Currently, Hudi organizes data according to fileGroup granularity. The fileGroup is further divided into column clusters to introduce the columngroup concept. Within a ColumnGroup, there is a ColumnSegment that allows for multiple file slices for the same column group to prevent large files. Inside each ColumnSegment, there are one or more FileSlices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does RLI now point to a filgroup/colsegment

- Base file: [file_id]\_[write_token]\_[begin_time][_cfName].[extension]
- Log file: [file_id]\_[begin_instant_time][_cfName].log.[version]_[write_token]
- Base file: [file_id]\_[write_token]\_[begin_time][_cgName_cgSegment].[extension]
- Log file: [file_id]\_[begin_instant_time][_cgName_cgSegment].log.[version]_[write_token]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cgName, cgSegment will be autogenerated by Hudi, right?

are these uuids..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this, given that begin_time is all digits, if we stick to the convention of begin_time being at the end of the file, but before the extension.

Since begin_time is a fixed length string, i feel it's more useful to put fixed length details before an extension. Reason being that one can just delimit by the period, then move forward N characters.

I assume cgName_cgSegment is variable length, this might make extracting begin_time harder in the future where users have to fall back to regex instead of just using a few keyboard shortcuts.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants