-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply hints for nested tables #2165
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Adding this type annotation fixed 69 failing tests. The missing Optional impacted the dlt.common.validation.validate_dict().validate_prop() functions to parse the RESTAPIConfig object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my suggestion how to deal with naming convention. Docs requirements are in the ticket.
bef7a3f
to
e51bd25
Compare
The current implementation adds the tables to the schema (as tested), but it doesn't affect how the data is loaded. For example, the hints will appear in pipeline.default_schema.tables.keys()
# ignoring the dlt tables
# 'nested_data', 'override_child_outer1', 'override_child_outer1_innerfoo','nested_data__outer1', 'nested_data__outer1__innerfoo' Whereas the normalizer row counts show no ingested data for the tables pipeline.last_trace.last_normalize_info.row_counts
# 'nested_data': 2, 'nested_data__outer1': 2, 'nested_data__outer1__innerfoo': 2 I believe changes need to be made to The extractor would need to hold some mapping, but it could be more appropriate to move the logic to |
Relational normalizer follows its logic of creating nested tables and column names. it comes only from the data. there's no mechanism to rename those, except the root table name which the user must set.
I assume that in example you are giving, you used a custom table name for nested table. If this is not the case ping me on slack. maybe there's a bug somewhere in the ticket above, there's a note:
so I'd say we block setting table name on nested hints (also parent name and incremental do make sense) |
Working on the requested documentation (#1647), I encounter many edge cases of the current implementation. My understanding is that our goal is for these two snippets to be equivalent Current testFrom @dlt.source(schema=schema)
def ethereum(slice_: slice = None):
@dlt.resource(
table_name="blocks",
write_disposition={"disposition": "merge", "strategy": merge_strategy},
)
def data():
with open(
"tests/normalize/cases/ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
"r",
encoding="utf-8",
) as f:
yield json.load(f) if slice_ is None else json.load(f)[slice_]
# also modify the child tables (not nested)
schema_ = dlt.current.source_schema()
blocks__transactions = schema_.tables["blocks__transactions"]
blocks__transactions["write_disposition"] = "merge"
blocks__transactions["x-merge-strategy"] = merge_strategy # type: ignore[typeddict-unknown-key]
blocks__transactions["table_format"] = destination_config.table_format
blocks__transactions__logs = schema_.tables["blocks__transactions__logs"]
blocks__transactions__logs["write_disposition"] = "merge"
blocks__transactions__logs["x-merge-strategy"] = merge_strategy # type: ignore[typeddict-unknown-key]
blocks__transactions__logs["table_format"] = destination_config.table_format
return data Target API@dlt.source(schema=schema)
def ethereum(slice_: slice = None):
@dlt.resource(
table_name="blocks",
write_disposition={"disposition": "merge", "strategy": "delete-insert"}
)
def data():
with open(
"./ethereum.blocks.9c1d9b504ea240a482b007788d5cd61c_2.json",
"r",
encoding="utf-8",
) as f:
yield json.load(f) if slice_ is None else json.load(f)[slice_]
data.apply_hints(
path=["transactions"],
write_disposition="merge",
additional_table_hints={"x-merge-strategy": "delete-insert"},
table_format=None,
)
data.apply_hints(
path=["transactions", "logs"],
write_disposition="merge",
additional_table_hints={"x-merge-strategy": "delete-insert"},
table_format=None,
)
return data ProblemCurrently, inspecting the source via When in the lifecycle should the hints be assigned? After Setting parents
This PR explicitly sets odd bugWhen debugging the call to def ensure_compatible_tables(
schema_name: str, tab_a: TTableSchema, tab_b: TPartialTableSchema, ensure_columns: bool = True
) -> None:
"""Ensures that `tab_a` and `tab_b` can be merged without conflicts. Conflicts are detected when
- tables have different names
- nested tables have different parents
- tables have any column with incompatible types
Note: all the identifiers must be already normalized
"""
if tab_a["name"] != tab_b["name"]:
raise TablePropertiesConflictException(
schema_name, tab_a["name"], "name", tab_a["name"], tab_b["name"]
)
table_name = tab_a["name"]
# check if table properties can be merged
# breakpoint()
# tab_a.get("parent"): None
# tab_b.get("parent"): None
if tab_a.get("parent") != tab_b.get("parent"):
# breakpoint()
# tab_a.get("parent"): None
# tab_b.get("parent"): "blocks"
raise TablePropertiesConflictException(
schema_name, table_name, "parent", tab_a.get("parent"), tab_b.get("parent")
)
if not ensure_columns:
return |
right!
this is pretty deep PR. we'll take time to iron this out. I still have some issues with user interface (ie. @dlt.resource should accept a list of nested resources so users do not need to do many |
@zilto I've fixed a few things in the normalizer and I've made a few changes, not sure that is enough for you to take over and finish it. pls take a look and decide. what happened:
|
d268006
to
87f8a4e
Compare
… table name from meta was not applied in compute_table_schema of hints
87f8a4e
to
61c8872
Compare
633cad8
to
ec8d2bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all files. I believe there are some typos and dev comments to remove.
I'm very excited about this feature. Awesome work!
diff_table = computed_table | ||
computed_tables = self._compute_tables(resource, items, meta) | ||
# overwrite root table name (if coming from meta) | ||
# TODO: remove below, remove root_table_name from arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the commented out code is ready to be removed?
) | ||
# this is a new table so allow evolve once | ||
if schema_contract["columns"] != "evolve" and self.schema.is_new_table(table_name): | ||
computed_table["x-normalizer"] = {"evolve-columns-once": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sidenote: Reading through the codebase "evolve-columns-once": True
is set after the table sees data for the first time. Maybe the name "columns-evolved-once"
would better communicate that.
@@ -748,38 +748,43 @@ def test_table_name_meta_normalized() -> None: | |||
def test_row_id_is_primary_key() -> None: | |||
# TODO: if there's a column with row_id hint and primary_key, it should get propagated | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove pass
and TODO
comment
assert table_schemas == expected_table_schemas | ||
|
||
@dlt.transformer(nested_hints=nested_hints) | ||
def campaign_detials(campaing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo detials -> details
in several places
] | ||
|
||
table_schemas = campaigns().compute_nested_table_schemas("campagins", Schema("default").naming) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo campagins -> campaigns
in several places
|
||
# take only the first block. the first block does not have uncles so this table should not be created and merged | ||
info = p.run( | ||
ethereum(slice(1)), | ||
**destination_config.run_kwargs, | ||
) | ||
assert_load_info(info) | ||
print(p.default_schema.to_pretty_yaml()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove print()
|
||
|
||
def test_nested_hints_write_disposition_append_merge() -> None: | ||
# we can mix replace and append write disposition in nested tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design comment: do we want to allow this behavior? My main concern is backward compatibility once people hack around this feature. Maybe pipeline.run()
can log a warning WriteDispositionMismatch
or UndefinedWriteDisposition
with message that says behavior might change in the future.
I would expect the default/intended behavior to be that the nested table always follows the root table behavior
Description
Draft of nested table hints implementation:
Is working so far but there are some bugs and tests needed.
Related Issues
Additional Context