-
Notifications
You must be signed in to change notification settings - Fork 50
Column name mapping support #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…rt nested types and should be more efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks!
Couple of questions,
Is there no way to get the tests working with our current test infra?
Can we also check renaming of columns to different (previously valid names?) and that the contents of the tables is consistent? Something like the following tests should be good
check column drop also drops values
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 add column b;
# in test
select count(*) from t1 where b is null
check column rename to other (previously valid) column name also preserves correct values.
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 rename column a to c;
# in test
select c from t1;
# verify selecting a is an error
select a from t1;
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 rename column a to b;
# in test
# verify a is the range column
# verify selecting a results in error
select a from t1;
@@ -0,0 +1 @@ | |||
{"location":"data/column_mapping/warehouse/default.db/my_table","table-uuid":"199c7f6d-3808-4a95-84db-f3cf06322240","last-updated-ms":1744881705813,"last-column-id":3,"schemas":[{"type":"struct","fields":[{"id":1,"name":"id","type":"long","required":false},{"id":2,"name":"name","type":"string","required":false},{"id":3,"name":"age","type":"long","required":false}],"schema-id":0,"identifier-field-ids":[]}],"current-schema-id":0,"partition-specs":[{"spec-id":0,"fields":[]}],"default-spec-id":0,"last-partition-id":999,"properties":{},"snapshots":[],"snapshot-log":[],"metadata-log":[],"sort-orders":[{"order-id":0,"fields":[]}],"default-sort-order-id":0,"refs":{},"statistics":[],"format-version":2,"last-sequence-number":0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you put these files in data/persistent/column_mapping?
heres where you can find some tests related to name-mapping |
…t-schema-id' of the metadata.json
Thanks for the suggestion, I've crafted a test for one of these, but I don't know how relevant these scenarios are for this feature. All this PR does is fill in field-ids for parquet files that don't have them. Same goes for the rename from a -> c, this happens at the global level ( And lastly the drop b, rename a -> b is the same as above, only the name in the global schema changes, the name-mapping is applied at the local level, and field-ids are unchanged. |
…ing for the latest snapshot. Only when we are reading a specific snapshot based on id/timestamp should we use 'schema-id' of the snapshot
…zing its not actually 'metadata', its manifest data..
This PR implements #147