Derive name mapping from table schema when iceberg.id is missing #24795

weijiii · 2025-01-24T23:15:56Z

Description

Currently when iceberg.id is present for any file columns including nested ones from the ORC data files, the name mapping from table metadata would be ignored. If the ORC data file is malformed where no iceberg.id is present, querying the table would incorrectly yield NULL for all columns from the data files. Name mapping can be derived from the table schema. If iceberg.id is present in the file columns, name mapping would not affect anything. If iceberg.id is missing in all file columns, the name mapping from table properties should be used to set the missing iceberg.id attributes. If name mapping is not configured in the table properties, one that is derived from table schema would be used as alternative.

Additional context and related issues

Iceberg does something similar for Avro tables when name mapping is not present [1]

Release notes

(v) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

xkrogen · 2025-01-24T23:59:36Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcMetrics.java

@@ -87,7 +87,7 @@ public static Metrics fileMetrics(TrinoInputFile file, MetricsConfig metricsConf
            Footer footer = reader.get().getFooter();

            // use name mapping to compute missing Iceberg field IDs
-            Optional<NameMapping> nameMapping = Optional.of(MappingUtil.create(schema));
+            NameMapping nameMapping = MappingUtil.create(schema);


So if I'm understanding correctly, we have two different codepaths that eventually land at fileColumnsByIcebergId. One from IcebergPageSourceProvider, and one here in OrcMetrics. Prior to this PR, the latter would use MappingUtil to derive the schema, but the former would only use the table properties. And in this PR, we are standardizing so that both codepaths use MappingUtil. Do I have that right?

I'm curious, doesn't this OrcMetrics codepath also need to handle the case where the name mapping is set as a JSON in the table properties? Like why aren't these codepaths completely unified in how they derive a NameMapping from the combination of ORC files / table metadata / table schema?

IIUC OrcMetrics::fileMetrics is only used when migrating table to Iceberg, therefore the Iceberg table metadata is not yet created as well as the table properties. Column Ids are expected to be missing in the ORC files so we are using the name mapping derived from table schema to fill the gap.

Makes sense, thanks for clarifying!

ebyhr

Please add tests.

wmoustafa · 2025-01-29T19:30:05Z

Thanks @weijiii for this PR and for adding the additional context from Iceberg. +1 to this change after adding the tests.

cla-bot bot added the cla-signed label Jan 24, 2025

github-actions bot added the iceberg Iceberg connector label Jan 24, 2025

weijiii force-pushed the iceberg-orc-infer-name-mapping branch from ef0d3db to d5f34c2 Compare January 24, 2025 23:39

xkrogen reviewed Jan 24, 2025

View reviewed changes

ebyhr reviewed Jan 25, 2025

View reviewed changes

Derive name mapping from schema when ORC iceberg.id is missing

d90decc

weijiii force-pushed the iceberg-orc-infer-name-mapping branch from d5f34c2 to d90decc Compare January 26, 2025 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive name mapping from table schema when iceberg.id is missing #24795

Derive name mapping from table schema when iceberg.id is missing #24795

weijiii commented Jan 24, 2025 •

edited

Loading

xkrogen Jan 24, 2025

weijiii Jan 26, 2025 •

edited

Loading

xkrogen Jan 27, 2025

ebyhr left a comment

wmoustafa commented Jan 29, 2025

Derive name mapping from table schema when iceberg.id is missing #24795

Are you sure you want to change the base?

Derive name mapping from table schema when iceberg.id is missing #24795

Conversation

weijiii commented Jan 24, 2025 • edited Loading

Description

Additional context and related issues

Release notes

xkrogen Jan 24, 2025

Choose a reason for hiding this comment

weijiii Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

xkrogen Jan 27, 2025

Choose a reason for hiding this comment

ebyhr left a comment

Choose a reason for hiding this comment

wmoustafa commented Jan 29, 2025

weijiii commented Jan 24, 2025 •

edited

Loading

weijiii Jan 26, 2025 •

edited

Loading