Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds SQL support for Configurable Table Snapshot History #262

Merged
merged 10 commits into from
Dec 19, 2024

Conversation

Will-Lo
Copy link
Collaborator

@Will-Lo Will-Lo commented Dec 4, 2024

Summary

Adds Spark SQL support for configurable table snapshots, which controls the versioning of the Openhouse tables.
Syntax is similar to retention but is instead defined as HISTORY.

History configuration supports both MAX_AGE and VERSIONS, where we retain all table snapshots that live within MAX_AGE and within VERSIONS.

Example: A table with MAX_AGE = 1d will retain all snapshots that are within 1 day of when the snapshot retention job last ran.
A table with VERSIONS = 5 will retain the last 5 snapshots of the table without considering the age of the snapshots
If both MAX_AGE = 1d and VERSIONS = 5 is defined, keep the last 5 snapshots within the last day. Note: If there are less than 5 snapshots, then there were less than 5 commits done in the past day.

MAX_AGE and VERSIONS cannot be defined as less than 1.
The default maximums of MAX_AGE and VERSIONS defined in #259 are 3 days and 100 versions respectively.

Examples:

ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H VERSIONS=10)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D)
ALTER TABLE <db>.<table> SET POLICY (HISTORY VERSIONS=5)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Tested on local docker running spark:
Tested setting both policies

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d VERSIONS=3 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+-----------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 3,
    "granularity": "DAY",
    "versions": 3
  }
}|
+--------+-----------------------------------------------------------------------------------------------------------------+

Setting only versions:

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY VERSIONS=20 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+----------------------------------------------------------------------------------------+
|key     |value                                                                                   |
+--------+----------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 0,
    "versions": 20
  }
}|
+--------+----------------------------------------------------------------------------------------+

Setting only max age

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+------------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                             |
+--------+------------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 8,
    "granularity": "HOUR",
    "versions": 0
  }
}|
+--------+------------------------------------------------------------------------------------------------------------------+

Also tested negative cases (invalid numbers, past maximums defined in #259)
e.g.

scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( HISTORY MAX_AGE=1h VERSIONS=2 )").show
org.apache.iceberg.exceptions.BadRequestException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":" : History for the table [LocalHadoopCluster.db.tb] max age must be between 1 to 3 days","stacktrace":null,"cause":"Not Available"}


For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@Will-Lo Will-Lo marked this pull request as ready for review December 4, 2024 02:32
Copy link
Member

@abhisheknath2011 abhisheknath2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Will-Lo. Overall looks good. Added some comments.

@Will-Lo Will-Lo changed the title Adds SQL support for Configurable Table Snapshots Adds SQL support for Configurable Table Snapshot History Dec 4, 2024
@Will-Lo Will-Lo closed this Dec 6, 2024
@Will-Lo Will-Lo reopened this Dec 6, 2024
@Will-Lo Will-Lo force-pushed the snapshots_policy_sql branch from a2316b4 to 6b7999a Compare December 6, 2024 19:16
@Will-Lo Will-Lo closed this Dec 6, 2024
@Will-Lo Will-Lo reopened this Dec 6, 2024
@maluchari
Copy link
Collaborator

Summary

Adds Spark SQL support for configurable table snapshots, which controls the versioning of the Openhouse tables. Syntax is similar to retention but is instead defined as HISTORY.

Examples:

ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H MIN_VERSIONS=10)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MIN_VERSIONS=5)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Tested on local docker running spark: Tested setting both policies

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d MIN_VERSIONS=3 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+-----------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 3,
    "granularity": "DAY",
    "minVersions": 3
  }
}|
+--------+-----------------------------------------------------------------------------------------------------------------+

Setting only versions:

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MIN_VERSIONS=20 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+----------------------------------------------------------------------------------------+
|key     |value                                                                                   |
+--------+----------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 0,
    "minVersions": 20
  }
}|
+--------+----------------------------------------------------------------------------------------+

Setting only max age

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+------------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                             |
+--------+------------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 8,
    "granularity": "HOUR",
    "minVersions": 0
  }
}|
+--------+------------------------------------------------------------------------------------------------------------------+

Also tested negative cases (invalid numbers, past maximums defined in #259)

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Can we please add details regarding the SQL API contract? If both versions and max age is present what is expected for example.

maluchari
maluchari previously approved these changes Dec 16, 2024
Copy link
Collaborator

@maluchari maluchari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. But overall LGTM

Copy link
Member

@abhisheknath2011 abhisheknath2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring the tests and answering the questions. Added minor comment, otherwise LGTM!

autumnust
autumnust previously approved these changes Dec 16, 2024
Copy link
Collaborator

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Will-Lo Will-Lo dismissed stale reviews from abhisheknath2011 and maluchari via 6039e0b December 17, 2024 19:53
@Will-Lo Will-Lo merged commit abccdc3 into linkedin:main Dec 19, 2024
1 check passed
Will-Lo added a commit that referenced this pull request Jan 6, 2025
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

Following up from #262 and
#259
this PR adds support for snapshot expiration table maintenance job to
use the history policies defined.

Most notably snapshot expiration will follow the settings of `maxAge`,
`granularity`, and `versions` as follows:
1. If maxAge is provided, it will expire snapshots older than maxAge in
granularity timeunit.
2. If versions is provided, it will retain the last versions snapshots
regardless of their age.
3. If both are provided, it will prioritize maxAge; only retain up to
versions number of snapshots younger than the maxAge. This is done by
pruning the snapshots older than maxAge, and then running a second
expiration to keeping N versions after that.

Note: If versions are defined and there are less than N versions in the
history, then there were not enough commits (within that timespan if
defined). Snapshot expiration will always keep at least 1 version.

The default behavior of snapshot expiration job will remain the same,
keep snapshots within the last 3 days.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants