Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validators for History / Snapshot Retention Policy #259

Merged
merged 21 commits into from
Dec 20, 2024

Conversation

Will-Lo
Copy link
Collaborator

@Will-Lo Will-Lo commented Nov 28, 2024

Summary

Supports defining a policy to control the versions of snapshots tables to retain.
The policy will support 2 types of configurations:

  1. Time-based (e.g. 3 days)
  2. Count-based (e.g. 10 versions)

It can also support a combination of the 2 policies, in which case it will retain versions that fall under both categories (e.g. only keep versions newer than 3 days AND within 10 versions).

This PR also refactors the PolicySpecValidator to accurately depict that it's referencing to partition-based retention of tables. In the future, we should refactor these policy classes to follow some Policy interface for better standardization for future policy support (duplicated interfaces between Replication, Partition Retention, Retain Snapshot).

This PR also defines the maximums for snapshot retention defaults (TODO: make this configurable/static?)

The current maximums defined is 3 days and 100 versions

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Ran on local docker and tested using postman by creating a new table with the following:

{
    "tableId": "t2",
    "databaseId": "d3",
    "baseTableVersion": "INITIAL_VERSION",
    "clusterId": "LocalFSCluster",
    "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}",
    "tableProperties": {
        "key": "value"
    },
    "policies": {
        "sharingEnabled": "true",
        "history": {"maxAge": 1, "granularity": "DAY", "versions": 2}
    }
}

Tested getTable after and confirmed the policy was showing correctly.

Also tested the negative cases (high version count, negative numbers) which worked as well.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@Will-Lo Will-Lo marked this pull request as ready for review December 2, 2024 22:05
@Will-Lo Will-Lo changed the title Add validators for SnapshotRetentionPolicy Add validators for Retain Snapshot Policy Dec 3, 2024
@Will-Lo Will-Lo changed the title Add validators for Retain Snapshot Policy Add validators for Snapshot Retention Policy Dec 3, 2024
@Will-Lo Will-Lo changed the title Add validators for Snapshot Retention Policy Add validators for History / Snapshot Retention Policy Dec 5, 2024
@Will-Lo Will-Lo closed this Dec 6, 2024
@Will-Lo Will-Lo reopened this Dec 6, 2024
Copy link
Collaborator

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round pass -- some minor points, main thing was about the minimal number of versions.
This is sort of an important check IMO since we've advertising the "Fast recovery" from bad data corruption because of this, comparing to traditional table format.

@maluchari
Copy link
Collaborator

Also tested the negative cases (high version count, negative numbers) which worked as well.

Nice. can we also capture the error messages as part of testing section in this PR?

maluchari
maluchari previously approved these changes Dec 18, 2024
Copy link
Collaborator

@maluchari maluchari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments.

Will-Lo added a commit that referenced this pull request Dec 19, 2024
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Adds Spark SQL support for configurable table snapshots, which controls
the versioning of the Openhouse tables.
Syntax is similar to retention but is instead defined as `HISTORY`.

History configuration supports both `MAX_AGE` and `VERSIONS`, where we
retain all table snapshots that live within `MAX_AGE` and within
`VERSIONS`.

Example: A table with `MAX_AGE = 1d` will retain all snapshots that are
within 1 day of when the snapshot retention job last ran.
A table with `VERSIONS = 5` will retain the last 5 snapshots of the
table without considering the age of the snapshots
If both `MAX_AGE = 1d` and `VERSIONS = 5` is defined, keep the last 5
snapshots within the last day. Note: If there are less than 5 snapshots,
then there were less than 5 commits done in the past day.

`MAX_AGE` and `VERSIONS` cannot be defined as less than 1.
The default maximums of `MAX_AGE` and `VERSIONS` defined in
#259 are 3 days and 100
versions respectively.

Examples:
```
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H VERSIONS=10)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D)
ALTER TABLE <db>.<table> SET POLICY (HISTORY VERSIONS=5)
```

## Changes

- [x] Client-facing API Changes
- [x] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

Tested on local docker running spark:
Tested setting both policies
```
scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d VERSIONS=3 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+-----------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 3,
    "granularity": "DAY",
    "versions": 3
  }
}|
+--------+-----------------------------------------------------------------------------------------------------------------+

```

Setting only versions:
```
scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY VERSIONS=20 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+----------------------------------------------------------------------------------------+
|key     |value                                                                                   |
+--------+----------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 0,
    "versions": 20
  }
}|
+--------+----------------------------------------------------------------------------------------+
```

Setting only max age
```
scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+------------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                             |
+--------+------------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 8,
    "granularity": "HOUR",
    "versions": 0
  }
}|
+--------+------------------------------------------------------------------------------------------------------------------+
```

Also tested negative cases (invalid numbers, past maximums defined in
#259)
e.g.

```
scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( HISTORY MAX_AGE=1h VERSIONS=2 )").show
org.apache.iceberg.exceptions.BadRequestException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":" : History for the table [LocalHadoopCluster.db.tb] max age must be between 1 to 3 days","stacktrace":null,"cause":"Not Available"}


```

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: Stas Pak <[email protected]>
Copy link
Collaborator

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Will-Lo Will-Lo merged commit 92bce12 into linkedin:main Dec 20, 2024
1 check passed
Will-Lo added a commit that referenced this pull request Jan 6, 2025
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

Following up from #262 and
#259
this PR adds support for snapshot expiration table maintenance job to
use the history policies defined.

Most notably snapshot expiration will follow the settings of `maxAge`,
`granularity`, and `versions` as follows:
1. If maxAge is provided, it will expire snapshots older than maxAge in
granularity timeunit.
2. If versions is provided, it will retain the last versions snapshots
regardless of their age.
3. If both are provided, it will prioritize maxAge; only retain up to
versions number of snapshots younger than the maxAge. This is done by
pruning the snapshots older than maxAge, and then running a second
expiration to keeping N versions after that.

Note: If versions are defined and there are less than N versions in the
history, then there were not enough commits (within that timespan if
defined). Snapshot expiration will always keep at least 1 version.

The default behavior of snapshot expiration job will remain the same,
keep snapshots within the last 3 days.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants