-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"model" item_format support takeover #2423
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
bcff4eb
to
f2e648f
Compare
), | ||
ids=lambda x: x.name, | ||
) | ||
def test_multiple_statements_per_resource(destination_config: DestinationTestConfiguration) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sh-rp This test fails for postgres because it's strict about enforcing unique constraints:
dlt.destinations.exceptions.DatabaseTerminalException: duplicate key value violates unique constraint "copied_table__dlt_id_key"
DETAIL: Key (_dlt_id)=(L4WTgeyuBgTi/A) already exists.
On that note, I'm not sure what the purpose of this test is 👀, considering the comment implies that we're yielding two models to create 2 new tables - even though only one new table is created, namely the copied_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good! but there are a few conceptual problems we must address:
- in the simples case the model is just a string. and then we use destination dialect to parse it. this is fine
- but there will be cases when different dialect is used to write transformations and different to execute them. ie, stuff is created with duckdb and executed on snowflake. in that case we lose that information.
My take: convert
- ModelStr into named tuple (ie. SqlModel) (dialect, model) where dialect is optional
- allow to create it form sqlglot expression that will populate both elements
- file writer should store both dialect and model. my take:
first line: dialect: name is optional
next lines: actual model
this way we can store models with and without dialect.
file writer should do basic sanity checks:
- parse model using dialect
- if dialect not found, use destination dialect (you should have caps there)
- if no caps - ansi dialect is used
- after parsing you should have a list of columns which you must compare with a list of columns in dlt schema. raise on any mismatch. transformation layer must generate correct data here (with that, databricks problem should be detected early)
in model job: parse model file and use correct dialect to load it, then if different dialect was used - transpile it to destination dialect.
now a few other things:
- we should be able to enable Athena! the dialect that we need is Trino (models are queries, then we do DML - for that Trino is used)
- Athena actually has more engines. Hive is used for DDL. for that we'd need a separate setting in caps. just ignore it for now
I am not convinced that anyone would be creating these statements on one database type and send them to another. You are building these sql strings with ibis expressions on a dlt dataset and this dataset will be the one you load to in the end, so the dialects will match. If we really want to keep the option to load any model file on any destination, I would suggest making it a requirement that the model files be in the sqlglot generic dialect, this way the ModelJob, which knows where it is being executed, may transpile it on the fly. About Athena: That would be cool, but for me this would be a follow up PR. Another PS: Detecting matching columns is not always possible, for example if there is a Star Select it will not be known without checking the schema of the destination. Also this is really a job for the transformation layer and is kind of implemented there already. |
Disclaimer:
This is a takeover PR of this PR.
Description
This PR introduces a new item-format "model" which is a sql select statement which will be inserted into a given table. For now, all default sql destinations support it, except for Athena and Dremio.
It is possible to mark an item with a HintsMeta to indicate the model item_type, which means the yielded string will be interpreted as a valid select statement for the targeted destination. During extraction each statement will be stored in its own job.
A resource emitting a model query would look like this, given an input dataset. Columns need to be supplied so dlt can create / update the table:
Related to #2420, #2399.