-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate delete file statistics for DLO strategy #287
base: main
Are you sure you want to change the base?
Conversation
1898935
to
5867db2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Christian. Looks good! Left a few questions
strategy.getPosDeleteFileCount(), | ||
strategy.getEqDeleteFileCount(), | ||
strategy.getPosDeleteFileBytes(), | ||
strategy.getEqDeleteFileBytes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be a nested struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that these values are related, but a con is that the a struct would make ser/de have more steps.
is there another pro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just cleaner, we can put it into struct, and during analysis read in trino, which is struct.field_name
, non-blocking though
...src/test/java/com/linkedin/openhouse/jobs/spark/DataLayoutStrategyGeneratorSparkAppTest.java
Outdated
Show resolved
Hide resolved
libs/datalayout/src/main/java/com/linkedin/openhouse/datalayout/datasource/FileStat.java
Outdated
Show resolved
Hide resolved
...ark/src/main/java/com/linkedin/openhouse/jobs/spark/DataLayoutStrategyGeneratorSparkApp.java
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,131 @@ | |||
package com.linkedin.openhouse.datalayout.persistence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do DAO refactor into a separate PR and not mix with the delete file stats PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact let's do the delete file stats PR first, and then do DAO refactor, will need to think more about it and will take longer for that to land.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will take this out of this PR,
will need to think more about it and will take longer for that to land.
I'd like to follow up with the refactor shortly after this PR deploys and works. whats the concern? the refactor is intended to not change behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce scope of this PR to not include DAO refactor.
} else { | ||
rows.add( | ||
String.format( | ||
"('%s', current_timestamp(), %f, %f, %f)", | ||
fqtn, strategy.getCost(), strategy.getGain(), strategy.getEntropy())); | ||
"('%s', current_timestamp(), %f, %f, %f, %d, %d, %d, %d)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%d for bytes? Do you see a risk of overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, no it should not.
initial search say %d is sufficient for Long.MAX_VALUE
diving deeper is not a clear code/doc answer. so I tested it on a range of very large numbers:
https://replit.com/join/xeprpdjctd-chrisbush747
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, make sure you test the sql execution as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing!
this code path isn't tested in main
i added unittests, they were green, but it required a bit of refactor, which we chose to remove in this PR
I will follow up with the tests again shortly after this PR is deployed
in the follow up, i will also ensure the LONG.MAX_VALUE is passed from pojo, string format, sql, and read back correctly
private String path; | ||
private long size; | ||
private long sizeInBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this serialized as size into tblproperty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this value is only used in intermediate processing before being computed for DataLayoutStrategy (DLS) object. the DLS object is then serialized into tblproperty
here's the intermediate processing: https://github.com/linkedin/openhouse/pull/287/files#diff-454fe015ffc81fffb4f558b25e865396a81ebca0f3b2444e422d3379cf1ccf29R149
and here is where DLS is serialized into tblproperty:
Line 32 in 5a0ec3c
fqtn, DATA_LAYOUT_STRATEGIES_PROPERTY_KEY, propValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks clean, just one more question
strategy.getPosDeleteFileCount(), | ||
strategy.getEqDeleteFileCount(), | ||
strategy.getPosDeleteFileBytes(), | ||
strategy.getEqDeleteFileBytes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just cleaner, we can put it into struct, and during analysis read in trino, which is struct.field_name
, non-blocking though
.partitionValues(partitionValues) | ||
.build(); | ||
} | ||
|
||
public static FileContent fromId(int id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static FileContent fromId(int id) { | |
private static FileContent fromId(int id) { |
List<DataLayoutStrategy> strategies = strategyGenerator.generatePartitionLevelStrategies(); | ||
Assertions.assertEquals(10, strategies.size()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing a bit of side cleaning, since this is a bug
- the test is for a partitioned table
- the strategy generated at the end is a non-partitioned table
so i'm fixing it to generate a strategy for partitioned table
Summary
problem: delete files are accumulating across tables due to trino default write behavior for UPDATE/MERGE/DELET statements
solution:
outcome: databases accumulating delete files will be discovered and compacted appropriately to prevent degraded query impact
✨ bonus content ✨ :
I noticed there is a gap in DLO testing because manual validation must be done to verify the final form and contents of the DLO table, e.g. #284
so:
this will help automate the testing for future changes and prevent bad changes from occuring due to validation being manually done
Changes
Testing Done
Additional Information