-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derive name mapping from table schema when iceberg.id is missing #24795
base: master
Are you sure you want to change the base?
Conversation
ef0d3db
to
d5f34c2
Compare
@@ -87,7 +87,7 @@ public static Metrics fileMetrics(TrinoInputFile file, MetricsConfig metricsConf | |||
Footer footer = reader.get().getFooter(); | |||
|
|||
// use name mapping to compute missing Iceberg field IDs | |||
Optional<NameMapping> nameMapping = Optional.of(MappingUtil.create(schema)); | |||
NameMapping nameMapping = MappingUtil.create(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding correctly, we have two different codepaths that eventually land at fileColumnsByIcebergId
. One from IcebergPageSourceProvider
, and one here in OrcMetrics
. Prior to this PR, the latter would use MappingUtil
to derive the schema, but the former would only use the table properties. And in this PR, we are standardizing so that both codepaths use MappingUtil
. Do I have that right?
I'm curious, doesn't this OrcMetrics
codepath also need to handle the case where the name mapping is set as a JSON in the table properties? Like why aren't these codepaths completely unified in how they derive a NameMapping
from the combination of ORC files / table metadata / table schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC OrcMetrics::fileMetrics
is only used when migrating table to Iceberg, therefore the Iceberg table metadata is not yet created as well as the table properties. Column Ids are expected to be missing in the ORC files so we are using the name mapping derived from table schema to fill the gap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests.
d5f34c2
to
d90decc
Compare
Thanks @weijiii for this PR and for adding the additional context from Iceberg. +1 to this change after adding the tests. |
Description
iceberg.id
is present for any file columns including nested ones from the ORC data files, the name mapping from table metadata would be ignored. If the ORC data file is malformed where noiceberg.id
is present, querying the table would incorrectly yieldNULL
for all columns from the data files. Name mapping can be derived from the table schema. Ificeberg.id
is present in the file columns, name mapping would not affect anything. Ificeberg.id
is missing in all file columns, the name mapping from table properties should be used to set the missingiceberg.id
attributes. If name mapping is not configured in the table properties, one that is derived from table schema would be used as alternative.Additional context and related issues
Release notes
(v) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: