"model" item_format support takeover #2423

anuunchin · 2025-03-19T18:55:34Z

Disclaimer:

This is a takeover PR of this PR.

Description

This PR introduces a new item-format "model" which is a sql select statement which will be inserted into a given table. For now, all default sql destinations support it, except for Athena and Dremio.

It is possible to mark an item with a HintsMeta to indicate the model item_type, which means the yielded string will be interpreted as a valid select statement for the targeted destination. During extraction each statement will be stored in its own job.

A resource emitting a model query would look like this, given an input dataset. Columns need to be supplied so dlt can create / update the table:

@dlt.resource()
def copied_table() -> Any:
    query = dataset["example_table"].limit(5).query()
    yield dlt.mark.with_hints(
        query, hints=make_hints(columns={...), data_item_format="model"
    )

Related to #2420, #2399.

netlify · 2025-03-19T18:55:51Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`fbd7f9e`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67e2d6bbbd9d47000864d55b

…lglot_dialect

anuunchin · 2025-03-25T15:09:17Z

tests/load/pipeline/test_model_item_format.py

+    ),
+    ids=lambda x: x.name,
+)
+def test_multiple_statements_per_resource(destination_config: DestinationTestConfiguration) -> None:


@sh-rp This test fails for postgres because it's strict about enforcing unique constraints:

dlt.destinations.exceptions.DatabaseTerminalException: duplicate key value violates unique constraint "copied_table__dlt_id_key" DETAIL: Key (_dlt_id)=(L4WTgeyuBgTi/A) already exists.

On that note, I'm not sure what the purpose of this test is 👀, considering the comment implies that we're yielding two models to create 2 new tables - even though only one new table is created, namely the copied_table

rudolfix

This is really good! but there are a few conceptual problems we must address:

in the simples case the model is just a string. and then we use destination dialect to parse it. this is fine
but there will be cases when different dialect is used to write transformations and different to execute them. ie, stuff is created with duckdb and executed on snowflake. in that case we lose that information.

My take: convert

ModelStr into named tuple (ie. SqlModel) (dialect, model) where dialect is optional
allow to create it form sqlglot expression that will populate both elements
file writer should store both dialect and model. my take:
first line: dialect: name is optional
next lines: actual model
this way we can store models with and without dialect.

file writer should do basic sanity checks:

parse model using dialect
if dialect not found, use destination dialect (you should have caps there)
if no caps - ansi dialect is used
after parsing you should have a list of columns which you must compare with a list of columns in dlt schema. raise on any mismatch. transformation layer must generate correct data here (with that, databricks problem should be detected early)

in model job: parse model file and use correct dialect to load it, then if different dialect was used - transpile it to destination dialect.

now a few other things:

we should be able to enable Athena! the dialect that we need is Trino (models are queries, then we do DML - for that Trino is used)
Athena actually has more engines. Hive is used for DDL. for that we'd need a separate setting in caps. just ignore it for now

sh-rp · 2025-04-01T16:17:03Z

I am not convinced that anyone would be creating these statements on one database type and send them to another. You are building these sql strings with ibis expressions on a dlt dataset and this dataset will be the one you load to in the end, so the dialects will match. If we really want to keep the option to load any model file on any destination, I would suggest making it a requirement that the model files be in the sqlglot generic dialect, this way the ModelJob, which knows where it is being executed, may transpile it on the fly.

About Athena: That would be cool, but for me this would be a follow up PR.

Another PS: Detecting matching columns is not always possible, for example if there is a Star Select it will not be known without checking the schema of the destination. Also this is really a job for the transformation layer and is kind of implemented there already.

anuunchin changed the title ~~"model" item_format support~~ "model" item_format support takeover Mar 19, 2025

anuunchin marked this pull request as draft March 19, 2025 18:57

anuunchin marked this pull request as ready for review March 21, 2025 12:46

anuunchin requested a review from sh-rp March 21, 2025 12:50

This was referenced Mar 24, 2025

Testing sqlglot's insert #2420

Closed

"model" item_format support #2404

Closed

sh-rp assigned sh-rp and anuunchin and unassigned sh-rp Mar 24, 2025

sh-rp and others added 10 commits March 25, 2025 11:23

Add support for model jobs

2dbdb1a

add correct insert job

c096ab0

add some test skeletons

36f0c92

simplify modelhints

bccd926

add more model tests

83a8f6c

Basic sqlglot insert statement, rebased on devel to have access to sq…

287c6a9

…lglot_dialect

sqlglot sql() to dialect, non staging dests added

aa29c89

Databricks, clickhouse, dremio added, additional test

c4cc2fa

Dremio typo fix, dest caps test fixed

9acad7f

model format checking order invariant

f2e648f

anuunchin force-pushed the feat/2366-sql-jobs-3 branch from bcff4eb to f2e648f Compare March 25, 2025 10:24

anuunchin commented Mar 25, 2025

View reviewed changes

Dremio removed

fbd7f9e

anuunchin requested a review from rudolfix April 1, 2025 13:29

rudolfix requested changes Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"model" item_format support takeover #2423

"model" item_format support takeover #2423

anuunchin commented Mar 19, 2025 •

edited

Loading

netlify bot commented Mar 19, 2025 •

edited

Loading

anuunchin Mar 25, 2025

rudolfix left a comment

sh-rp commented Apr 1, 2025 •

edited

Loading

"model" item_format support takeover #2423

Are you sure you want to change the base?

"model" item_format support takeover #2423

Conversation

anuunchin commented Mar 19, 2025 • edited Loading

Disclaimer:

Description

netlify bot commented Mar 19, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

anuunchin Mar 25, 2025

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Apr 1, 2025 • edited Loading

anuunchin commented Mar 19, 2025 •

edited

Loading

netlify bot commented Mar 19, 2025 •

edited

Loading

sh-rp commented Apr 1, 2025 •

edited

Loading