-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validators for History / Snapshot Retention Policy #259
Add validators for History / Snapshot Retention Policy #259
Conversation
...ain/java/com/linkedin/openhouse/tables/api/spec/v0/request/components/SnapshotRetention.java
Outdated
Show resolved
Hide resolved
...a/com/linkedin/openhouse/tables/api/validator/impl/SnapshotRetentionPolicySpecValidator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first round pass -- some minor points, main thing was about the minimal number of versions.
This is sort of an important check IMO since we've advertising the "Fast recovery" from bad data corruption because of this, comparing to traditional table format.
...bles/src/main/java/com/linkedin/openhouse/tables/api/spec/v0/request/components/History.java
Show resolved
Hide resolved
...les/src/main/java/com/linkedin/openhouse/tables/api/spec/v0/request/components/Policies.java
Show resolved
Hide resolved
.../main/java/com/linkedin/openhouse/tables/api/validator/impl/OpenHouseTablesApiValidator.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/linkedin/openhouse/tables/api/validator/impl/HistoryPolicySpecValidator.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/linkedin/openhouse/tables/api/validator/impl/HistoryPolicySpecValidator.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/linkedin/openhouse/tables/api/validator/impl/HistoryPolicySpecValidator.java
Show resolved
Hide resolved
...main/java/com/linkedin/openhouse/tables/api/validator/impl/RetentionPolicySpecValidator.java
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/HistoryConfig.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/HistoryConfig.java
Outdated
Show resolved
Hide resolved
Nice. can we also capture the error messages as part of testing section in this PR? |
...bles/src/main/java/com/linkedin/openhouse/tables/api/spec/v0/request/components/History.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/linkedin/openhouse/tables/api/validator/impl/HistoryPolicySpecValidator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments.
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Adds Spark SQL support for configurable table snapshots, which controls the versioning of the Openhouse tables. Syntax is similar to retention but is instead defined as `HISTORY`. History configuration supports both `MAX_AGE` and `VERSIONS`, where we retain all table snapshots that live within `MAX_AGE` and within `VERSIONS`. Example: A table with `MAX_AGE = 1d` will retain all snapshots that are within 1 day of when the snapshot retention job last ran. A table with `VERSIONS = 5` will retain the last 5 snapshots of the table without considering the age of the snapshots If both `MAX_AGE = 1d` and `VERSIONS = 5` is defined, keep the last 5 snapshots within the last day. Note: If there are less than 5 snapshots, then there were less than 5 commits done in the past day. `MAX_AGE` and `VERSIONS` cannot be defined as less than 1. The default maximums of `MAX_AGE` and `VERSIONS` defined in #259 are 3 days and 100 versions respectively. Examples: ``` ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H VERSIONS=10) ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D) ALTER TABLE <db>.<table> SET POLICY (HISTORY VERSIONS=5) ``` ## Changes - [x] Client-facing API Changes - [x] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. Tested on local docker running spark: Tested setting both policies ``` scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d VERSIONS=3 )").show ++ || ++ ++ scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false) +--------+-----------------------------------------------------------------------------------------------------------------+ |key |value | +--------+-----------------------------------------------------------------------------------------------------------------+ |policies|{ "sharingEnabled": false, "history": { "maxAge": 3, "granularity": "DAY", "versions": 3 } }| +--------+-----------------------------------------------------------------------------------------------------------------+ ``` Setting only versions: ``` scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY VERSIONS=20 )").show ++ || ++ ++ scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false) +--------+----------------------------------------------------------------------------------------+ |key |value | +--------+----------------------------------------------------------------------------------------+ |policies|{ "sharingEnabled": false, "history": { "maxAge": 0, "versions": 20 } }| +--------+----------------------------------------------------------------------------------------+ ``` Setting only max age ``` scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show ++ || ++ ++ scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false) +--------+------------------------------------------------------------------------------------------------------------------+ |key |value | +--------+------------------------------------------------------------------------------------------------------------------+ |policies|{ "sharingEnabled": false, "history": { "maxAge": 8, "granularity": "HOUR", "versions": 0 } }| +--------+------------------------------------------------------------------------------------------------------------------+ ``` Also tested negative cases (invalid numbers, past maximums defined in #259) e.g. ``` scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( HISTORY MAX_AGE=1h VERSIONS=2 )").show org.apache.iceberg.exceptions.BadRequestException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":" : History for the table [LocalHadoopCluster.db.tb] max age must be between 1 to 3 days","stacktrace":null,"cause":"Not Available"} ``` For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: Stas Pak <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. Following up from #262 and #259 this PR adds support for snapshot expiration table maintenance job to use the history policies defined. Most notably snapshot expiration will follow the settings of `maxAge`, `granularity`, and `versions` as follows: 1. If maxAge is provided, it will expire snapshots older than maxAge in granularity timeunit. 2. If versions is provided, it will retain the last versions snapshots regardless of their age. 3. If both are provided, it will prioritize maxAge; only retain up to versions number of snapshots younger than the maxAge. This is done by pruning the snapshots older than maxAge, and then running a second expiration to keeping N versions after that. Note: If versions are defined and there are less than N versions in the history, then there were not enough commits (within that timespan if defined). Snapshot expiration will always keep at least 1 version. The default behavior of snapshot expiration job will remain the same, keep snapshots within the last 3 days. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Summary
Supports defining a policy to control the versions of snapshots tables to retain.
The policy will support 2 types of configurations:
It can also support a combination of the 2 policies, in which case it will retain versions that fall under both categories (e.g. only keep versions newer than 3 days AND within 10 versions).
This PR also refactors the PolicySpecValidator to accurately depict that it's referencing to partition-based retention of tables. In the future, we should refactor these policy classes to follow some
Policy
interface for better standardization for future policy support (duplicated interfaces between Replication, Partition Retention, Retain Snapshot).This PR also defines the maximums for snapshot retention defaults (TODO: make this configurable/static?)
The current maximums defined is 3 days and 100 versions
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
Ran on local docker and tested using postman by creating a new table with the following:
Tested getTable after and confirmed the policy was showing correctly.
Also tested the negative cases (high version count, negative numbers) which worked as well.
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.