Skip to content

Column name mapping support #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Tishj
Copy link
Collaborator

@Tishj Tishj commented Apr 17, 2025

This PR implements #147

@Tishj Tishj requested a review from Tmonster April 24, 2025 07:02
Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks!
Couple of questions,

Is there no way to get the tests working with our current test infra?

Can we also check renaming of columns to different (previously valid names?) and that the contents of the tables is consistent? Something like the following tests should be good

check column drop also drops values

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 add column b;
# in test
select count(*) from t1 where b is null

check column rename to other (previously valid) column name also preserves correct values.

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 rename column a to c;
# in test
select c from t1;
# verify selecting a is an error
select a from t1;
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 rename column a to b;
# in test
# verify a is the range column
# verify selecting a results in error
select a from t1;

@@ -0,0 +1 @@
{"location":"data/column_mapping/warehouse/default.db/my_table","table-uuid":"199c7f6d-3808-4a95-84db-f3cf06322240","last-updated-ms":1744881705813,"last-column-id":3,"schemas":[{"type":"struct","fields":[{"id":1,"name":"id","type":"long","required":false},{"id":2,"name":"name","type":"string","required":false},{"id":3,"name":"age","type":"long","required":false}],"schema-id":0,"identifier-field-ids":[]}],"current-schema-id":0,"partition-specs":[{"spec-id":0,"fields":[]}],"default-spec-id":0,"last-partition-id":999,"properties":{},"snapshots":[],"snapshot-log":[],"metadata-log":[],"sort-orders":[{"order-id":0,"fields":[]}],"default-sort-order-id":0,"refs":{},"statistics":[],"format-version":2,"last-sequence-number":0}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you put these files in data/persistent/column_mapping?

@Tishj
Copy link
Collaborator Author

Tishj commented Apr 28, 2025

Awesome thanks! Couple of questions,

Is there no way to get the tests working with our current test infra?

Can we also check renaming of columns to different (previously valid names?) and that the contents of the tables is consistent? Something like the following tests should be good

check column drop also drops values

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 add column b;
# in test
select count(*) from t1 where b is null

check column rename to other (previously valid) column name also preserves correct values.

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 rename column a to c;
# in test
select c from t1;
# verify selecting a is an error
select a from t1;
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 rename column a to b;
# in test
# verify a is the range column
# verify selecting a results in error
select a from t1;

Thanks for the suggestion, I've crafted a test for one of these, but I don't know how relevant these scenarios are for this feature.

All this PR does is fill in field-ids for parquet files that don't have them.
The drop+re-add scenario creates a new schema, where the new b field will have a new field-id (3).
The name-mapping will assign field-id 2 to the parquet file's b column, so it doesn't serve as data for the new b field, because the field-id doesn't match.

Same goes for the rename from a -> c, this happens at the global level (metadata.json), the field-id doesn't change.

And lastly the drop b, rename a -> b is the same as above, only the name in the global schema changes, the name-mapping is applied at the local level, and field-ids are unchanged.

Tishj added 2 commits April 28, 2025 18:57
…ing for the latest snapshot. Only when we are reading a specific snapshot based on id/timestamp should we use 'schema-id' of the snapshot
…zing its not actually 'metadata', its manifest data..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants