Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"model" item_format support takeover #2423

Open
wants to merge 11 commits into
base: devel
Choose a base branch
from
Open

Conversation

anuunchin
Copy link
Contributor

@anuunchin anuunchin commented Mar 19, 2025

Disclaimer:

This is a takeover PR of this PR.

Description

This PR introduces a new item-format "model" which is a sql select statement which will be inserted into a given table. For now, all default sql destinations support it, except for Athena and Dremio.

It is possible to mark an item with a HintsMeta to indicate the model item_type, which means the yielded string will be interpreted as a valid select statement for the targeted destination. During extraction each statement will be stored in its own job.

A resource emitting a model query would look like this, given an input dataset. Columns need to be supplied so dlt can create / update the table:

@dlt.resource()
def copied_table() -> Any:
    query = dataset["example_table"].limit(5).query()
    yield dlt.mark.with_hints(
        query, hints=make_hints(columns={...), data_item_format="model"
    )

Related to #2420, #2399.

Copy link

netlify bot commented Mar 19, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit fbd7f9e
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67e2d6bbbd9d47000864d55b

@anuunchin anuunchin changed the title "model" item_format support "model" item_format support takeover Mar 19, 2025
@anuunchin anuunchin marked this pull request as draft March 19, 2025 18:57
@anuunchin anuunchin marked this pull request as ready for review March 21, 2025 12:46
@anuunchin anuunchin requested a review from sh-rp March 21, 2025 12:50
This was referenced Mar 24, 2025
@sh-rp sh-rp assigned sh-rp and anuunchin and unassigned sh-rp Mar 24, 2025
@anuunchin anuunchin force-pushed the feat/2366-sql-jobs-3 branch from bcff4eb to f2e648f Compare March 25, 2025 10:24
),
ids=lambda x: x.name,
)
def test_multiple_statements_per_resource(destination_config: DestinationTestConfiguration) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh-rp This test fails for postgres because it's strict about enforcing unique constraints:

dlt.destinations.exceptions.DatabaseTerminalException: duplicate key value violates unique constraint "copied_table__dlt_id_key"
DETAIL:  Key (_dlt_id)=(L4WTgeyuBgTi/A) already exists.

On that note, I'm not sure what the purpose of this test is 👀, considering the comment implies that we're yielding two models to create 2 new tables - even though only one new table is created, namely the copied_table

@anuunchin anuunchin requested a review from rudolfix April 1, 2025 13:29
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! but there are a few conceptual problems we must address:

  • in the simples case the model is just a string. and then we use destination dialect to parse it. this is fine
  • but there will be cases when different dialect is used to write transformations and different to execute them. ie, stuff is created with duckdb and executed on snowflake. in that case we lose that information.

My take: convert

  • ModelStr into named tuple (ie. SqlModel) (dialect, model) where dialect is optional
  • allow to create it form sqlglot expression that will populate both elements
  • file writer should store both dialect and model. my take:
    first line: dialect: name is optional
    next lines: actual model
    this way we can store models with and without dialect.

file writer should do basic sanity checks:

  • parse model using dialect
  • if dialect not found, use destination dialect (you should have caps there)
  • if no caps - ansi dialect is used
  • after parsing you should have a list of columns which you must compare with a list of columns in dlt schema. raise on any mismatch. transformation layer must generate correct data here (with that, databricks problem should be detected early)

in model job: parse model file and use correct dialect to load it, then if different dialect was used - transpile it to destination dialect.

now a few other things:

  • we should be able to enable Athena! the dialect that we need is Trino (models are queries, then we do DML - for that Trino is used)
  • Athena actually has more engines. Hive is used for DDL. for that we'd need a separate setting in caps. just ignore it for now

@sh-rp
Copy link
Collaborator

sh-rp commented Apr 1, 2025

I am not convinced that anyone would be creating these statements on one database type and send them to another. You are building these sql strings with ibis expressions on a dlt dataset and this dataset will be the one you load to in the end, so the dialects will match. If we really want to keep the option to load any model file on any destination, I would suggest making it a requirement that the model files be in the sqlglot generic dialect, this way the ModelJob, which knows where it is being executed, may transpile it on the fly.

About Athena: That would be cool, but for me this would be a follow up PR.

Another PS: Detecting matching columns is not always possible, for example if there is a Star Select it will not be known without checking the schema of the destination. Also this is really a job for the transformation layer and is kind of implemented there already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants