Core, Data: File Format API interfaces#12774
Conversation
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @pvary for this pr, left some comments, genearlly looks great!
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/PositionDeleteWriterBuilder.java
Outdated
Show resolved
Hide resolved
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @pvary for this pr, LGTM!
data/src/main/java/org/apache/iceberg/data/AppenderBuilder.java
Outdated
Show resolved
Hide resolved
liurenjie1024
left a comment
There was a problem hiding this comment.
LGTM, just some nits!
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
stevenzwu
left a comment
There was a problem hiding this comment.
left some initial comments on the interfaces. will still need to take a look at the other bigger PR to understand more on the work as a whole.
data/src/main/java/org/apache/iceberg/data/ObjectModelRegistry.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/FileWriteBuilderBase.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/PositionDeleteWriteBuilder.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilder.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/ContentFileWriteBuilderImpl.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/FileAccessFactoryRegistry.java
Outdated
Show resolved
Hide resolved
|
@huaxingao, @pvary Could you take a look from a comet prospective? I know you have some custom code that would be using this as well |
Originally I thought that the |
…ppender<D> build()' instead '<D> FileAppender<D> build()'
646d9f7 to
d450ec6
Compare
…ed by the FormatModel implementations
core/src/main/java/org/apache/iceberg/formats/CommonWriteBuilder.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/formats/FormatModelRegistry.java
Outdated
Show resolved
Hide resolved
|
Merged the changes from #12298. The remaining open questions are highlighted:
Based on @rdblue’s comment on the other PR, the change is “about ready to go in.” However, as it has changed somewhat compared to earlier iterations, previous reviewers may want to take a final look. Thanks for everyone spending time on this! |
core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java
Outdated
Show resolved
Hide resolved
| * Creates a writer for the given schemas. | ||
| * | ||
| * @param icebergSchema the Iceberg schema defining the table structure | ||
| * @param fileSchema the file format specific target schema for the output files |
There was a problem hiding this comment.
Not a blocker, but I want to note it somewhere: the file schema will be converted from the Iceberg Schema and will directly match in structure, names, and equivalent types. We should probably document this and make sure that all of the format builders follow that rule.
Similarly, we should also think about guarantees and/or requirements of the engine schema. I think this is dependent on the engine, though. If an engine builds its integration so that names must match or positions must match, that's up to the engine.
There was a problem hiding this comment.
Do you think we should add it here too, or it is enough as we describe it at the writer and the reader?
FileWriterBuilder.engineSchema
* <p>The engine schema must be aligned with the Iceberg schema, but may include representation
* details that Iceberg considers equivalent.
ReadBuilder.engineProjection
* <p>When provided, this schema should be consistent with the requested Iceberg projection, while
* allowing representation differences. Examples include:
There was a problem hiding this comment.
I think this should state how the file schema is derived because it is always a direct translation from the Iceberg schema and the structure and names match.
For the engine schema, I think mentioning that it is engine-specific is the right thing to do.
core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/formats/BaseFormatModel.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/formats/FileWriterBuilder.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/formats/FileWriterBuilder.java
Outdated
Show resolved
Hide resolved
| * <p>Some data types require additional type information from the engine schema that cannot be | ||
| * fully expressed by the Iceberg schema or the data itself. For example, a variant type may use a | ||
| * shredded representation that relies on engine-specific metadata to map back to the Iceberg | ||
| * schema. |
There was a problem hiding this comment.
I think a simple example (tinyint / int) would help as well.
There was a problem hiding this comment.
Added a sentence about the ints
| * schema. | ||
| * | ||
| * <p>The engine schema must be aligned with the Iceberg schema, but may include representation | ||
| * details that Iceberg considers equivalent. |
There was a problem hiding this comment.
This is a good way to state the requirement.
There was a problem hiding this comment.
Is it enough here, or shall we add it somewhere else too?
There was a problem hiding this comment.
I think it's fine here. Engine schema is really a contract between the engine and its registered object model.
| * | ||
| * <p>This registry provides access to {@link ReadBuilder}s for data consumption and {@link | ||
| * FileWriterBuilder}s for writing various types of Iceberg content files. The appropriate builder | ||
| * is selected based on {@link FileFormat} and object model name. |
| * <p>{@link FormatModel} objects are registered through {@link #register(FormatModel)} and used for | ||
| * creating readers and writers. Read builders are returned directly from the factory. Write | ||
| * builders may be wrapped in specialized content file writer implementations depending on the | ||
| * requested builder type. |
There was a problem hiding this comment.
Not sure that this part about the write builders being wrapped needs to be mentioned here. What about "Readers and writers are created using builders from the static factory methods of this class."
|
I think this is ready and from our regular syncs as well as the discussion on the dev list, I don't think that there are any remaining blockers so I'll go ahead and merge it. That will unblock the file format and engine integrations, which can all happen in parallel after this core work is done. Thanks @pvary! |
|
Opened the follow-up PRs:
I would appreciate reviews on them. Thanks, |
The interface part of the changes from #12298
Interfaces which have to be implemented by the File Formats:
ReadBuilder- Builder for reading data from data filesAppenderBuilder- Builder for writing data to data filesObjectModel- Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pairInterfaces which are used by the actual readers/writers:
AppenderBuilder- Builder for writing a fileDataWriterBuilder- Builder for generating a data filePositionDeleteWriterBuilder- Builder for generating a position delete fileEqualityDeleteWriterBuilder- Builder for generating an equality delete fileReadBuilderhere - the file format reader builder is reusedImplementation classes tying them together:
WriterBuilderclass which implements the AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder interfaces using the AppenderBuilder provided by the File Format itselfObjectModelRegistrywhich stores the availableObjectModelsand users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.