Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(DLT-PLUS-105): factor table format filesystem destinations to less bound modules #2419

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

jfahne
Copy link
Collaborator

@jfahne jfahne commented Mar 18, 2025

Description

Ahead of changes to allow more support for Apache Iceberg catalogs, refactor the table format filesystem destinations (Apache Iceberg and LFAI Delta currently) to be less coupled with bare (having no format specification) filesystem implementation details. The behavior of the filesystem destination has not been modified in anyway.

Additional Context

This change is in part to demonstrate further simplifications that could be made throughout the project by preferring structural sub typing as described in PEP 544 to compose more generic module level definitions. Today, the project makes use of multiple inheritance with a large amount of the parent class overridden by child implementations and interstitial layers of mixin classes which bind implementations excessively making changes difficult.

The example of note in this change is the elimination of the TableFormatFilesystemLoadJob which inherits from FilesystemLoadJob only to override almost all of the implementation of FilesystemLoadJob which itself inherits from RunnableLoadJob. This multiple inheritance makes it difficult to understand what the few stubs attached to IcebergFilesystemLoadJob do in context and difficult to identify the right layer at which to extend implementation in support of new features. Instead this change sacrifices a small amount of boilerplate added by eliminating the TableFormatFilesystemLoadJob for the gain of far fewer nested definitions, protocol definitions which describe the role that a load job or filesystem client instance plays in, for example, the construction of a remote path, and makes clear where to extend a class's implementation to enable new features.

After this change has been merged, it will be followed by refactoring specific to the Iceberg load job in furtherance of the dlt+ ticket for broader Iceberg catalog support.

Copy link

netlify bot commented Mar 18, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 3d3caef
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67d983473b8a9f0008eb98bf

@rudolfix rudolfix added the ci from fork run ci workflows on a pr even if they are from a fork label Mar 18, 2025
def _iceberg_table(self) -> "pyiceberg.table.Table": # type: ignore[name-defined] # noqa: F821
from dlt.common.libs.pyiceberg import get_catalog

catalog = get_catalog(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our goal is to be able to plug any catalog here, not just use ephemeral one. this is actually hardcoded.

so the idea was to add a method:

def get_catalog():
   from dlt.common.libs.pyiceberg import get_catalog

    return get_catalog(
            client=self._job_client,
            table_name=self.load_table_name,
            schema=self.arrow_dataset.schema,
            partition_columns=self._partition_columns,
        )

that you can override in dlt+ to plug any other catalog

LOAD_TABLE_WRITE_DISPOSITION_KEY: Final = "write_disposition"


class HasSourceFilePaths(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the simplification here. It generates a lot of duplicated code in concrete classes. This is probably a good abstraction but why not place this in a base class? I know you argued against it but here I have 5 protocols instead of single base class where all methods works together.
I'm saying this is wrong, I just need some clarifications

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format specific classes only depend on RunnableJob and the definitions in the module's common.py file. Because Delta, Iceberg, and other table formats like Hudi are more dissimilar beyond their shared use of filesystems as an abstraction for a storage layer, I would anticipate that their implementations deviate more as features are added to dlt for them and the maintenance of a common parent that is less general than RunnableJob becomes cumbersome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci from fork run ci workflows on a pr even if they are from a fork
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants