-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable file merging by last modification time using preserve-insertion-order #3157
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@esarili can you sign off the commits? There should be directions in the failing CI check below. https://github.com/delta-io/delta-rs/pull/3157/checks?check_run_id=36132013214 |
a03f097
to
f4cfa9f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3157 +/- ##
==========================================
- Coverage 72.18% 72.16% -0.02%
==========================================
Files 138 138
Lines 45292 45299 +7
Branches 45292 45299 +7
==========================================
- Hits 32692 32688 -4
+ Misses 10538 10531 -7
- Partials 2062 2080 +18 ☔ View full report in Codecov by Sentry. |
writer_properties: WriterProperties, | ||
preserve_insertion_order: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what you think about this going into WriterProperties rather than all these functions growing an additional argument.
Does this only benefit call paths for the create_merge_plan flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The writer properties are used in other operations, so we should only add it if it makes senso for those ops as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my thinking as well
as it so happens, @hntd187 and I just had a quick discussion around using the file updated times more generally. Turns out there are at least some scenarios where this may not do what one would expect it to. For the use case at hand, I wonder if this could also be done using "single value z-order" which should just degrade to sorting for a single value while also targeting a specific file size. This of course assumes that there is some value in the data that somehow correlates when the data was inserted. IF that were not the case, there would also not be much value though in preserving the order as query engines could not leverage this information. |
We run optimize command in regular intervals (every two-hours, on last two partitions where data is partitioned by day). AFAIU z-order runs on entire partition and sorts data on record level which might be resource intensive and might not be necessary for the append only workflows. With using file update times, we are hoping to avoid sorting entire partition record by record but still have some amount of locality after optimize runs. |
…sertion-order This change leverages the previously unused `preserve-insertion-order` configuration to enable merging files sorted by their last modification time during compaction. This is particularly beneficial for append-only workloads, improving data locality after optimize runs by merging files that were created around similar times. Signed-off-by: esarili <[email protected]>
2e60661
to
7333a36
Compare
@rtyler @hntd187 @roeap @ion-elgreco, is this change good to be merged? |
This change leverages the previously unused
preserve-insertion-order
configuration to enable merging files sorted by their last modification time during compaction. This is particularly beneficial for append-only workloads, improving data locality after optimize runs by merging files that were created around similar times.