Column name mapping support #192

Tishj · 2025-04-17T10:31:10Z

This PR implements #147

…_support

…rt nested types and should be more efficient

Tmonster

Awesome thanks!
Couple of questions,

Is there no way to get the tests working with our current test infra?

Can we also check renaming of columns to different (previously valid names?) and that the contents of the tables is consistent? Something like the following tests should be good

check column drop also drops values

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 add column b;
# in test
select count(*) from t1 where b is null

check column rename to other (previously valid) column name also preserves correct values.

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 rename column a to c;
# in test
select c from t1;
# verify selecting a is an error
select a from t1;

create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 rename column a to b;
# in test
# verify a is the range column
# verify selecting a results in error
select a from t1;

Tmonster · 2025-04-26T01:46:05Z

...ehouse/default.db/my_table/metadata/00000-aa288ee6-3a21-4880-a1a4-91333f273075.metadata.json

@@ -0,0 +1 @@
+{"location":"data/column_mapping/warehouse/default.db/my_table","table-uuid":"199c7f6d-3808-4a95-84db-f3cf06322240","last-updated-ms":1744881705813,"last-column-id":3,"schemas":[{"type":"struct","fields":[{"id":1,"name":"id","type":"long","required":false},{"id":2,"name":"name","type":"string","required":false},{"id":3,"name":"age","type":"long","required":false}],"schema-id":0,"identifier-field-ids":[]}],"current-schema-id":0,"partition-specs":[{"spec-id":0,"fields":[]}],"default-spec-id":0,"last-partition-id":999,"properties":{},"snapshots":[],"snapshot-log":[],"metadata-log":[],"sort-orders":[{"order-id":0,"fields":[]}],"default-sort-order-id":0,"refs":{},"statistics":[],"format-version":2,"last-sequence-number":0}


nit: could you put these files in data/persistent/column_mapping?

kevinjqliu · 2025-04-26T17:37:05Z

heres where you can find some tests related to name-mapping
https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/mapping/TestMappingUpdates.java
https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestSchemaAndMappingUpdate.java

…t-schema-id' of the metadata.json

…lable

Tishj · 2025-04-28T13:19:57Z

Awesome thanks! Couple of questions,

Is there no way to get the tests working with our current test infra?

Can we also check renaming of columns to different (previously valid names?) and that the contents of the tables is consistent? Something like the following tests should be good

check column drop also drops values
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 add column b;
# in test
select count(*) from t1 where b is null
check column rename to other (previously valid) column name also preserves correct values.
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 rename column a to c;
# in test
select c from t1;
# verify selecting a is an error
select a from t1;
create table t1 as select range a, random()*1000 b from range(10000);
alter table t1 drop column b;
alter table t1 rename column a to b;
# in test
# verify a is the range column
# verify selecting a results in error
select a from t1;

Thanks for the suggestion, I've crafted a test for one of these, but I don't know how relevant these scenarios are for this feature.

All this PR does is fill in field-ids for parquet files that don't have them.
The drop+re-add scenario creates a new schema, where the new b field will have a new field-id (3).
The name-mapping will assign field-id 2 to the parquet file's b column, so it doesn't serve as data for the new b field, because the field-id doesn't match.

Same goes for the rename from a -> c, this happens at the global level (metadata.json), the field-id doesn't change.

And lastly the drop b, rename a -> b is the same as above, only the name in the global schema changes, the name-mapping is applied at the local level, and field-ids are unchanged.

…ing for the latest snapshot. Only when we are reading a specific snapshot based on id/timestamp should we use 'schema-id' of the snapshot

…zing its not actually 'metadata', its manifest data..

Tishj added 6 commits April 9, 2025 13:38

Merge branch 'iceberg_metadata_struct' into column_name_mapping_support

30f6729

add parsing for 'schema.name-mapping.default' from properties

4dafb83

add initial support for column mapping

53a9dd8

Merge remote-tracking branch 'upstream/main' into column_name_mapping…

8a899bc

…_support

add test for column mapping

1c3faa7

improve the method of parsing the field mappings, this can also suppo…

a8a16d4

…rt nested types and should be more efficient

Tishj requested a review from Tmonster April 24, 2025 07:02

Tmonster reviewed Apr 26, 2025

View reviewed changes

Tishj added 3 commits April 28, 2025 14:34

add test with name mapping

727bdbf

fix a bug, use the schema-id from the snapshot instead of the 'curren…

607f8a3

…t-schema-id' of the metadata.json

whoops, schema should be set from the metadata if no snapshot is avai…

5e2c5df

…lable

Tishj added 2 commits April 28, 2025 18:57

wait actually, using 'current-schema-id' is correct, when we are look…

c37f654

…ing for the latest snapshot. Only when we are reading a specific snapshot based on id/timestamp should we use 'schema-id' of the snapshot

thought about adding 'properties' to 'iceberg_metadata', before reali…

31a0071

…zing its not actually 'metadata', its manifest data..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column name mapping support #192

Column name mapping support #192

Tishj commented Apr 17, 2025

Tmonster left a comment

Tmonster Apr 26, 2025

kevinjqliu commented Apr 26, 2025

Tishj commented Apr 28, 2025

		@@ -0,0 +1 @@
		{"location":"data/column_mapping/warehouse/default.db/my_table","table-uuid":"199c7f6d-3808-4a95-84db-f3cf06322240","last-updated-ms":1744881705813,"last-column-id":3,"schemas":[{"type":"struct","fields":[{"id":1,"name":"id","type":"long","required":false},{"id":2,"name":"name","type":"string","required":false},{"id":3,"name":"age","type":"long","required":false}],"schema-id":0,"identifier-field-ids":[]}],"current-schema-id":0,"partition-specs":[{"spec-id":0,"fields":[]}],"default-spec-id":0,"last-partition-id":999,"properties":{},"snapshots":[],"snapshot-log":[],"metadata-log":[],"sort-orders":[{"order-id":0,"fields":[]}],"default-sort-order-id":0,"refs":{},"statistics":[],"format-version":2,"last-sequence-number":0}

Column name mapping support #192

Are you sure you want to change the base?

Column name mapping support #192

Conversation

Tishj commented Apr 17, 2025

Tmonster left a comment

Choose a reason for hiding this comment

Tmonster Apr 26, 2025

Choose a reason for hiding this comment

kevinjqliu commented Apr 26, 2025

Tishj commented Apr 28, 2025