-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(DLT-PLUS-105): factor table format filesystem destinations to less bound modules #2419
base: devel
Are you sure you want to change the base?
feat(DLT-PLUS-105): factor table format filesystem destinations to less bound modules #2419
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
def _iceberg_table(self) -> "pyiceberg.table.Table": # type: ignore[name-defined] # noqa: F821 | ||
from dlt.common.libs.pyiceberg import get_catalog | ||
|
||
catalog = get_catalog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our goal is to be able to plug any catalog here, not just use ephemeral one. this is actually hardcoded.
so the idea was to add a method:
def get_catalog():
from dlt.common.libs.pyiceberg import get_catalog
return get_catalog(
client=self._job_client,
table_name=self.load_table_name,
schema=self.arrow_dataset.schema,
partition_columns=self._partition_columns,
)
that you can override in dlt+ to plug any other catalog
LOAD_TABLE_WRITE_DISPOSITION_KEY: Final = "write_disposition" | ||
|
||
|
||
class HasSourceFilePaths(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the simplification here. It generates a lot of duplicated code in concrete classes. This is probably a good abstraction but why not place this in a base class? I know you argued against it but here I have 5 protocols instead of single base class where all methods works together.
I'm saying this is wrong, I just need some clarifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format specific classes only depend on RunnableJob
and the definitions in the module's common.py
file. Because Delta, Iceberg, and other table formats like Hudi are more dissimilar beyond their shared use of filesystems as an abstraction for a storage layer, I would anticipate that their implementations deviate more as features are added to dlt for them and the maintenance of a common parent that is less general than RunnableJob
becomes cumbersome.
Description
Ahead of changes to allow more support for Apache Iceberg catalogs, refactor the table format filesystem destinations (Apache Iceberg and LFAI Delta currently) to be less coupled with bare (having no format specification) filesystem implementation details. The behavior of the filesystem destination has not been modified in anyway.
Additional Context
This change is in part to demonstrate further simplifications that could be made throughout the project by preferring structural sub typing as described in PEP 544 to compose more generic module level definitions. Today, the project makes use of multiple inheritance with a large amount of the parent class overridden by child implementations and interstitial layers of mixin classes which bind implementations excessively making changes difficult.
The example of note in this change is the elimination of the
TableFormatFilesystemLoadJob
which inherits fromFilesystemLoadJob
only to override almost all of the implementation ofFilesystemLoadJob
which itself inherits fromRunnableLoadJob
. This multiple inheritance makes it difficult to understand what the few stubs attached toIcebergFilesystemLoadJob
do in context and difficult to identify the right layer at which to extend implementation in support of new features. Instead this change sacrifices a small amount of boilerplate added by eliminating theTableFormatFilesystemLoadJob
for the gain of far fewer nested definitions, protocol definitions which describe the role that a load job or filesystem client instance plays in, for example, the construction of a remote path, and makes clear where to extend a class's implementation to enable new features.After this change has been merged, it will be followed by refactoring specific to the Iceberg load job in furtherance of the dlt+ ticket for broader Iceberg catalog support.