Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Test type widening compatibility with other Delta features #3053

Merged
merged 3 commits into from
May 23, 2024

Conversation

johanl-db
Copy link
Collaborator

@johanl-db johanl-db commented May 6, 2024

Description

Additional tests covering type widening and:

  • Reading CDF
  • Column mapping
  • Time travel
  • RESTORE
  • CLONE

How was this patch tested?

Test only

@johanl-db johanl-db force-pushed the more-type-widening-tests branch 3 times, most recently from cd9c3e6 to d5ed864 Compare May 7, 2024 09:02
@johanl-db johanl-db self-assigned this May 8, 2024
Copy link
Collaborator

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few suggestions for additional tests. Do we already have tests for constraints and generated columns. AFAIK we disallow type changes on generated columns, but constraints might be interesting as it's possible to create a constraint after changing the type.

@johanl-db
Copy link
Collaborator Author

Added a few suggestions for additional tests. Do we already have tests for constraints and generated columns. AFAIK we disallow type changes on generated columns, but constraints might be interesting as it's possible to create a constraint after changing the type.

There are tests already for generated columns and constraints, added as part of #2881

I added a test here to cover constraints + RESTORE which is an interesting case:

  1. Change column a type from byte to int
  2. Add a CHECK constraint on column a.
  3. RESTORE table to before the type change.

The constraint was added on type int and the column now has type byte. This does work though because CHECK constraints are part of the table metadata and it gets removed during RESTORE.

Copy link
Collaborator

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tdas tdas merged commit 039a29a into delta-io:master May 23, 2024
10 checks passed
longvu-db pushed a commit to longvu-db/delta that referenced this pull request May 28, 2024
…elta-io#3053)

## Description
Additional tests covering type widening and:
- Reading CDF
- Column mapping
- Time travel
- RESTORE
- CLONE

## How was this patch tested?
Test only
longvu-db pushed a commit to longvu-db/delta that referenced this pull request May 30, 2024
…elta-io#3053)

Additional tests covering type widening and:
- Reading CDF
- Column mapping
- Time travel
- RESTORE
- CLONE

Test only
vkorukanti pushed a commit that referenced this pull request Jun 5, 2024
…iles to rewrite (#3155)

## What changes were proposed in this pull request?
The initial approach to identify files that contain a type that differs
from the table schema and that must be rewritten before dropping the
type widening table feature is convoluted and turns out to be more
brittle than intended.

This change switches instead to directly reading the file schema from
the Parquet footer and rewriting all files that have a mismatching type.

### Additional Context
Files are identified using their default row commit version (a part of
the row tracking feature) and matched against type changes previously
applied to the table and recorded in the table metadata: any file
written before the latest type change should use a different type and
must be rewritten.

This requires multiple pieces of information to be accurately tracked:
- Default row commit versions must be correctly assigned to all files.
E.p. files that are copied over without modification must never be
assigned a new default row commit version. On the other hand, default
row commit versions are preserved across CLONE but these versions don't
match anything in the new cloned table.
- Type change history must be reliably recorded and preserved across
schema changes, e.g. column mapping.

Any bug will likely lead to files not being correctly rewritten before
removing the table feature, potentially leaving the table in an
unreadable state.



## How was this patch tested?
Tests added in previous PR to cover CLONE and RESTORE:
#3053
Tests added and updated in this PR to cover rewriting files with
different column types when removing the table feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants