Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply hints for nested tables #2165

Open
wants to merge 31 commits into
base: devel
Choose a base branch
from
Open

Conversation

steinitzu
Copy link
Contributor

Description

Draft of nested table hints implementation:

apply_hints(path=['a', 'b', 'c'], columns=...)

Is working so far but there are some bugs and tests needed.

Related Issues

Additional Context

Copy link

netlify bot commented Dec 19, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 35131f6
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67d8a213b92065000888a3e0
😎 Deploy Preview https://deploy-preview-2165--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

zilto added 2 commits January 28, 2025 18:52
Adding this type annotation fixed 69 failing tests. The missing Optional
impacted the dlt.common.validation.validate_dict().validate_prop()
functions to parse the RESTAPIConfig object
@zilto zilto marked this pull request as ready for review January 29, 2025 01:42
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my suggestion how to deal with naming convention. Docs requirements are in the ticket.

@zilto zilto force-pushed the define-hints-nested-tables branch from bef7a3f to e51bd25 Compare January 29, 2025 19:01
@zilto
Copy link
Collaborator

zilto commented Jan 29, 2025

The current implementation adds the tables to the schema (as tested), but it doesn't affect how the data is loaded.

For example, the hints will appear in

pipeline.default_schema.tables.keys()
# ignoring the dlt tables
# 'nested_data', 'override_child_outer1', 'override_child_outer1_innerfoo','nested_data__outer1', 'nested_data__outer1__innerfoo'

Whereas the normalizer row counts show no ingested data for the tables

pipeline.last_trace.last_normalize_info.row_counts
# 'nested_data': 2, 'nested_data__outer1': 2, 'nested_data__outer1__innerfoo': 2

I believe changes need to be made to Extractor._write_to_dynamic_table() and _get_dynamic_table_name() to push data to the right table. (Extractor._write_to_static_table() should rely on the explicitly provided table name).

The extractor would need to hold some mapping, but it could be more appropriate to move the logic to dlt.common.normalizers.json.helpers or to a Schema method?

@rudolfix
Copy link
Collaborator

Relational normalizer follows its logic of creating nested tables and column names. it comes only from the data. there's no mechanism to rename those, except the root table name which the user must set.

dlt is data first, not schema first. it is counterintuitive if you chose to start your work with schema, not data.

I assume that in example you are giving, you used a custom table name for nested table. If this is not the case ping me on slack. maybe there's a bug somewhere

in the ticket above, there's a note:

You still may allow users to specify table_name on the nested hint. If you do so, you'll need to modify the normalizer so it maps paths to those names. IMO this is for another ticket and bigger overhaul of the schema
prevent following to be set on nested table:
parent_table_name: TTableHintTemplate[str] = None,
incremental: TIncrementalConfig = None,

so I'd say we block setting table name on nested hints (also parent name and incremental do make sense)

@zilto
Copy link
Collaborator

zilto commented Jan 30, 2025

We need a good example (in Examples). my take is: let's go for something advanced:
here's a test test_merge_on_keys_in_schema: this test models the nested json into a set of tables with natural primary keys (not dlt generated ones). it uses some tricks to modify the nested tables. now we should be able to use our nested hints to just decorate resource properly. you are also free to change this test first and run it!
the goal of the example is to show how you can custom model your nested tables.

Working on the requested documentation (#1647), I encounter many edge cases of the current implementation. My understanding is that our goal is for these two snippets to be equivalent

Current test

From tests/load/pipeline/test_merge_disposition.py::test_merge_on_keys_in_schema

@dlt.source(schema=schema)
def ethereum(slice_: slice = None):
    @dlt.resource(
        table_name="blocks",
        write_disposition={"disposition": "merge", "strategy": merge_strategy},
    )
    def data():
        with open(
            "tests/normalize/cases/ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
            "r",
            encoding="utf-8",
        ) as f:
            yield json.load(f) if slice_ is None else json.load(f)[slice_]

    # also modify the child tables (not nested)
    schema_ = dlt.current.source_schema()
    blocks__transactions = schema_.tables["blocks__transactions"]
    blocks__transactions["write_disposition"] = "merge"
    blocks__transactions["x-merge-strategy"] = merge_strategy  # type: ignore[typeddict-unknown-key]
    blocks__transactions["table_format"] = destination_config.table_format

    blocks__transactions__logs = schema_.tables["blocks__transactions__logs"]
    blocks__transactions__logs["write_disposition"] = "merge"
    blocks__transactions__logs["x-merge-strategy"] = merge_strategy  # type: ignore[typeddict-unknown-key]
    blocks__transactions__logs["table_format"] = destination_config.table_format

    return data

Target API

@dlt.source(schema=schema)
def ethereum(slice_: slice = None):
    @dlt.resource(
        table_name="blocks",
        write_disposition={"disposition": "merge", "strategy": "delete-insert"}
    )
    def data():
        with open(
            "./ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
            "r",
            encoding="utf-8",
        ) as f:
            yield json.load(f) if slice_ is None else json.load(f)[slice_]

      
    data.apply_hints(
        path=["transactions"],
        write_disposition="merge",
        additional_table_hints={"x-merge-strategy": "delete-insert"},
        table_format=None,
    )

    data.apply_hints(
        path=["transactions", "logs"],
        write_disposition="merge",
        additional_table_hints={"x-merge-strategy": "delete-insert"},
        table_format=None,
    )
    
    return data

Problem

Currently, inspecting the source via ethereum().schema gives different results. The Current Test shows the hints on the blocks__transactions and the blocks__transactions__logs tables. It is not the case for the Target API.

When in the lifecycle should the hints be assigned? After .apply_hints()? After source creation? After first pipeline run?

Setting parents

  File "/home/tjean/projects/dlthub/dlt/dlt/extract/extract.py", line 353, in _extract_single_source
    extractors[item_format].write_items(
  File "/home/tjean/projects/dlthub/dlt/dlt/extract/extractors.py", line 140, in write_items
    self._write_to_static_table(resource, table_name, items, meta)
  File "/home/tjean/projects/dlthub/dlt/dlt/extract/extractors.py", line 223, in _write_to_static_table
    items = self._compute_and_update_table(resource, table_name, items, meta)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tjean/projects/dlthub/dlt/dlt/extract/extractors.py", line 263, in _compute_and_update_table
    diff_table = utils.diff_table(self.schema.name, existing_table, computed_table)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tjean/projects/dlthub/dlt/dlt/common/schema/utils.py", line 461, in diff_table
    ensure_compatible_tables(schema_name, tab_a, tab_b, ensure_columns=False)
  File "/home/tjean/projects/dlthub/dlt/dlt/common/schema/utils.py", line 529, in ensure_compatible_tables
    raise TablePropertiesConflictException(
dlt.common.schema.exceptions.TablePropertiesConflictException: In schema: ethereum: Cannot merge partial tables into table `blocks__transactions` due to property `parent` with different values: "None" != "blocks"

This PR explicitly sets parent when .apply_hints(path=...) is used. This differs from default behavior. How should we address this?

odd bug

When debugging the call to ensure_compatible_tables(), the values magically change before and after the comparison operator (see comments).

def ensure_compatible_tables(
    schema_name: str, tab_a: TTableSchema, tab_b: TPartialTableSchema, ensure_columns: bool = True
) -> None:
    """Ensures that `tab_a` and `tab_b` can be merged without conflicts. Conflicts are detected when

    - tables have different names
    - nested tables have different parents
    - tables have any column with incompatible types

    Note: all the identifiers must be already normalized
    """
    if tab_a["name"] != tab_b["name"]:
        raise TablePropertiesConflictException(
            schema_name, tab_a["name"], "name", tab_a["name"], tab_b["name"]
        )
    table_name = tab_a["name"]
    # check if table properties can be merged
    # breakpoint()
    # tab_a.get("parent"): None
    # tab_b.get("parent"): None 
    if tab_a.get("parent") != tab_b.get("parent"):
        # breakpoint()
        # tab_a.get("parent"): None
        # tab_b.get("parent"): "blocks"
        raise TablePropertiesConflictException(
            schema_name, table_name, "parent", tab_a.get("parent"), tab_b.get("parent")
        )

    if not ensure_columns:
        return

@rudolfix
Copy link
Collaborator

Working on the requested documentation (#1647), I encounter many edge cases of the current implementation. My understanding is that our goal is for these two snippets to be equivalent

right!

When in the lifecycle should the hints be assigned? After .apply_hints()? After source creation? After first pipeline run?

  1. the schema property of source is the input schema that got passed in decorator. no hints are applied to it
  2. "Current test" does a hack and modifies it directly. That's why you see it, but it is not how it should work
  3. To get a schema with applied hints you must call source.discover_schema(). You'll get a schema clone with all hints applied to it. Note: now I see that compute_table_chain is not used there. please fix it if you want to see all your tables in schema clone.
  4. Extract phase uses source input schema, applies all hints to it and then merges changes into existing pipeline schema (from the previous runs). During extract - this all happens on clones, which get comitted to storage when extract method exists. TLDR;> you can get pipeline.default_schema after extract or full run and you should see all the changes
  5. please commit code that raises this weird bug. I'll take look. The "current test" is passing now.

this is pretty deep PR. we'll take time to iron this out. I still have some issues with user interface (ie. @dlt.resource should accept a list of nested resources so users do not need to do many apply_hints). we'll cover it at the end

@rudolfix
Copy link
Collaborator

@zilto I've fixed a few things in the normalizer and I've made a few changes, not sure that is enough for you to take over and finish it. pls take a look and decide. what happened:

  • relational normalizer is able to break nested tables chain if it sees a potentially nested table that does not define a parent hint
  • this creates several "root" tables coming from a single resource
  • I keep nested hints on the ResourceHints (but not on a ResourceNesterHints - so it is not recursive) so we can use table variants and dynamic nested hints
  • biggest change when creating tables from nested hints: "resource" and "parent" hints are added conditionally. we break the nested chain if nested table specifies primary/merge key
  • I also disabled creating intermediate nested tables... you can restore it if you want
  • now the ethereum tests are passing: transactions and logs tables are now top level (root) tables and they are properly merged - because the primary key is defined on them

@rudolfix rudolfix assigned rudolfix and unassigned zilto Mar 9, 2025
@rudolfix rudolfix self-requested a review March 9, 2025 18:14
@rudolfix rudolfix force-pushed the define-hints-nested-tables branch from d268006 to 87f8a4e Compare March 10, 2025 18:33
… table name from meta was not applied in compute_table_schema of hints
@rudolfix rudolfix force-pushed the define-hints-nested-tables branch from 87f8a4e to 61c8872 Compare March 10, 2025 20:33
@rudolfix rudolfix force-pushed the define-hints-nested-tables branch from 633cad8 to ec8d2bc Compare March 16, 2025 22:43
@rudolfix rudolfix removed their request for review March 17, 2025 22:28
Copy link
Collaborator

@zilto zilto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed all files. I believe there are some typos and dev comments to remove.

I'm very excited about this feature. Awesome work!

diff_table = computed_table
computed_tables = self._compute_tables(resource, items, meta)
# overwrite root table name (if coming from meta)
# TODO: remove below, remove root_table_name from arguments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the commented out code is ready to be removed?

)
# this is a new table so allow evolve once
if schema_contract["columns"] != "evolve" and self.schema.is_new_table(table_name):
computed_table["x-normalizer"] = {"evolve-columns-once": True}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: Reading through the codebase "evolve-columns-once": True is set after the table sees data for the first time. Maybe the name "columns-evolved-once" would better communicate that.

@@ -748,38 +748,43 @@ def test_table_name_meta_normalized() -> None:
def test_row_id_is_primary_key() -> None:
# TODO: if there's a column with row_id hint and primary_key, it should get propagated
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove pass and TODO comment

assert table_schemas == expected_table_schemas

@dlt.transformer(nested_hints=nested_hints)
def campaign_detials(campaing):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo detials -> details in several places

]

table_schemas = campaigns().compute_nested_table_schemas("campagins", Schema("default").naming)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo campagins -> campaigns in several places


# take only the first block. the first block does not have uncles so this table should not be created and merged
info = p.run(
ethereum(slice(1)),
**destination_config.run_kwargs,
)
assert_load_info(info)
print(p.default_schema.to_pretty_yaml())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print()



def test_nested_hints_write_disposition_append_merge() -> None:
# we can mix replace and append write disposition in nested tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design comment: do we want to allow this behavior? My main concern is backward compatibility once people hack around this feature. Maybe pipeline.run() can log a warning WriteDispositionMismatch or UndefinedWriteDisposition with message that says behavior might change in the future.

I would expect the default/intended behavior to be that the nested table always follows the root table behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify schema modification of child tables
3 participants