Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate delete file statistics for DLO strategy #287

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Feb 15, 2025

Summary

problem: delete files are accumulating across tables due to trino default write behavior for UPDATE/MERGE/DELET statements

solution:

  1. (this PR) - record delete file count/size per table and persist to table ✅
  2. let the DLO strategy execute for many days to accumulate data 🕐
  3. analyze the optimization algorithm and update the model to discover and act on tables who would benefit most from compacting their delete files 🕐
  4. (future PR) - codify the algorithm updates into DLO strategy 🕐

outcome: databases accumulating delete files will be discovered and compacted appropriately to prevent degraded query impact

bonus content ✨ :

I noticed there is a gap in DLO testing because manual validation must be done to verify the final form and contents of the DLO table, e.g. #284

so:

  1. will a functional test into IntegrationTest by moving the DLO output table logic into StrategiesDAOInternal with save/load methods, and validated DLO finishes and outputs a table with contents and shape as expected 🕐
  2. will add an acceptance test in internal app to validate schema is evolved correctly on the actual tables 🕐

this will help automate the testing for future changes and prevent bad changes from occuring due to validation being manually done

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

@cbb330 cbb330 force-pushed the main branch 3 times, most recently from 1898935 to 5867db2 Compare February 15, 2025 22:51
Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Christian. Looks good! Left a few questions

Comment on lines +90 to +93
strategy.getPosDeleteFileCount(),
strategy.getEqDeleteFileCount(),
strategy.getPosDeleteFileBytes(),
strategy.getEqDeleteFileBytes()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be a nested struct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these values are related, but a con is that the a struct would make ser/de have more steps.

is there another pro?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just cleaner, we can put it into struct, and during analysis read in trino, which is struct.field_name, non-blocking though

@@ -0,0 +1,131 @@
package com.linkedin.openhouse.datalayout.persistence;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do DAO refactor into a separate PR and not mix with the delete file stats PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact let's do the delete file stats PR first, and then do DAO refactor, will need to think more about it and will take longer for that to land.

Copy link
Collaborator Author

@cbb330 cbb330 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will take this out of this PR,

will need to think more about it and will take longer for that to land.

I'd like to follow up with the refactor shortly after this PR deploys and works. whats the concern? the refactor is intended to not change behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concern.

Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce scope of this PR to not include DAO refactor.

} else {
rows.add(
String.format(
"('%s', current_timestamp(), %f, %f, %f)",
fqtn, strategy.getCost(), strategy.getGain(), strategy.getEntropy()));
"('%s', current_timestamp(), %f, %f, %f, %d, %d, %d, %d)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%d for bytes? Do you see a risk of overflow?

Copy link
Collaborator Author

@cbb330 cbb330 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, no it should not.

initial search say %d is sufficient for Long.MAX_VALUE

diving deeper is not a clear code/doc answer. so I tested it on a range of very large numbers:
https://replit.com/join/xeprpdjctd-chrisbush747

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, make sure you test the sql execution as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing!

this code path isn't tested in main

i added unittests, they were green, but it required a bit of refactor, which we chose to remove in this PR

I will follow up with the tests again shortly after this PR is deployed

in the follow up, i will also ensure the LONG.MAX_VALUE is passed from pojo, string format, sql, and read back correctly

private String path;
private long size;
private long sizeInBytes;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this serialized as size into tblproperty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this value is only used in intermediate processing before being computed for DataLayoutStrategy (DLS) object. the DLS object is then serialized into tblproperty

here's the intermediate processing: https://github.com/linkedin/openhouse/pull/287/files#diff-454fe015ffc81fffb4f558b25e865396a81ebca0f3b2444e422d3379cf1ccf29R149

and here is where DLS is serialized into tblproperty:

Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks clean, just one more question

Comment on lines +90 to +93
strategy.getPosDeleteFileCount(),
strategy.getEqDeleteFileCount(),
strategy.getPosDeleteFileBytes(),
strategy.getEqDeleteFileBytes()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just cleaner, we can put it into struct, and during analysis read in trino, which is struct.field_name, non-blocking though

.partitionValues(partitionValues)
.build();
}

public static FileContent fromId(int id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static FileContent fromId(int id) {
private static FileContent fromId(int id) {

Comment on lines +33 to +35
List<DataLayoutStrategy> strategies = strategyGenerator.generatePartitionLevelStrategies();
Assertions.assertEquals(10, strategies.size());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing a bit of side cleaning, since this is a bug

  1. the test is for a partitioned table
  2. the strategy generated at the end is a non-partitioned table

so i'm fixing it to generate a strategy for partitioned table

@teamurko teamurko self-requested a review February 21, 2025 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants