"model" item_format support #2404

sh-rp · 2025-03-13T12:19:25Z

Description

This PR introduces a new item-format "model" which is a sql select statement which will be inserted into a given table. For now only the duckdb destination supports this. Further destinations will be available after @anuunchin finished her research on this.

It is possible to mark an item with a HintsMeta to indicate the model item_type, which means the yielded string will be interpreted as a valid select statement for the targeted destination. During extraction each statement will be stored in its own job.

A resource emitting a model query would look like this, given an input dataset. Columns need to be supplied so dlt can create / update the table:

@dlt.resource()
def copied_table() -> Any:
    query = dataset["example_table"].limit(5).query()
    yield dlt.mark.with_hints(
        query, hints=make_hints(columns={...), data_item_format="model"
    )

netlify · 2025-03-13T12:19:56Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`c7a9adf`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67d7f2c7a45bc7000893c322

sh-rp · 2025-03-13T12:49:50Z

dlt/common/data_writers/writers.py

+            is_binary_format=False,
+            supports_schema_changes="True",
+            supports_compression=False,
+            # NOTE: we create a new model file for each sql row


I think it makes sense to have one model per file

indeed you can yield many models per table in a single run. isn't it an error in the code that generates models?

@rudolfix I don't quite understand what this convo is about 👀. If rephrased, is the the same as:

@rudolfix: One can yield many models (many insert statements) per table in a single run, so it's wrong to have one model (insert statement) per table @sh-rp: No, it does make sense to have one model (insert statement) per table

sh-rp · 2025-03-13T12:50:21Z

dlt/destinations/impl/duckdb/factory.py

@@ -129,7 +129,7 @@ class duckdb(Destination[DuckDbClientConfiguration, "DuckDbClient"]):
    def _raw_capabilities(self) -> DestinationCapabilitiesContext:
        caps = DestinationCapabilitiesContext()
        caps.preferred_loader_file_format = "insert_values"
-        caps.supported_loader_file_formats = ["insert_values", "parquet", "jsonl"]
+        caps.supported_loader_file_formats = ["insert_values", "parquet", "jsonl", "model"]


For this first iteration we only support duckdb, the transformations can check the capabilities to figure out where pure sql may be used.

sh-rp · 2025-03-13T12:51:56Z

dlt/extract/hints.py

@@ -68,11 +69,17 @@ class TResourceHints(TResourceHintsBase, total=False):


 class HintsMeta:
-    __slots__ = ("hints", "create_table_variant")
+    __slots__ = ("hints", "create_table_variant", "data_item_format")


I opted to add a slot on the default HintsMeta, I could also create a subclass similar to the file import, but for me this solution somehow makes more sense, since it is kind of a hint, but does not need to go into the schema.

I think you do not need to modify HintsMeta. Just create a container class in which you'll yield models ie

class ModelStr(str): pass

and yield it. Are you 100% sure we do not need any model properties to be stored in the file? or we plan to pass those via TableSchema? ie. look here:
https://docs.getdbt.com/reference/model-configs (materialization)
https://docs.getdbt.com/docs/build/incremental-strategy (this is our write disposition + primary key) heh cool

I have done it this way now.

sh-rp · 2025-03-13T14:52:29Z

dlt/extract/utils.py

@@ -60,14 +62,21 @@
    pandas = None


-def get_data_item_format(items: TDataItems) -> TDataItemFormat:
+def get_data_item_format(items: TDataItems, meta: Any = None) -> TDataItemFormat:


Each row that has a special item type needs to have the meta set, similar to how the ImportFileMeta works. We could also store it so it works more like the other hints, but I think this should be ok. It's mostly going to be used from the transformations anyway.

as mentioned, just check the item type

rudolfix

looks cool! but IMO can be simplified

rudolfix · 2025-03-13T19:12:26Z

dlt/common/data_writers/writers.py

+            is_binary_format=False,
+            supports_schema_changes="True",
+            supports_compression=False,
+            # NOTE: we create a new model file for each sql row


indeed you can yield many models per table in a single run. isn't it an error in the code that generates models?

rudolfix · 2025-03-13T19:17:40Z

dlt/destinations/job_client_impl.py

+        """
+        sql_client = self._job_client.sql_client
+        name = sql_client.make_qualified_table_name(self._load_table["name"])
+        return f"INSERT INTO {name} {select_statement};"


INSERT is just one of the options. other options are: MERGE and DELETE .. INSERT which (guess what) work like our merge jobs.

I think you need to take PreparedTableSchema here and generate INSERT/MERGE code depending on write disposition (maybe not now, maybe in the future)

Also I assume that model jobs for a table chain will be loaded as a single statement (preferably with transaction)

we could probably replace our sql merge jobs with it.

Right now these inserts are done on the dataset that the jobs are initially loaded to, so if you have a sql transformation with write_disposition merge, it will insert the rows into the staging table and then create a regular merge jobs, so another sql job, that gets executed at the end. At least I think it should do this, I still have to write tests. I'd like to keep this PR as simple as possible for now and then migrate to more efficient jobs later.

heh right! but I see a way to skip the staging dataset...

Yes of course, but imho out of scope for now, because I am actually trying to build the transformations in time. unless you insist, then I'll do it :)

I think it could be a cool job for @anuunchin to see wether we can make nice merge jobs here.

rudolfix · 2025-03-13T19:18:29Z

dlt/common/storages/data_item_storage.py

            path = self._get_data_item_path_template(load_id, schema_name, table_name)
-            writer = BufferedDataWriter(self.writer_spec, path)
+            writer = BufferedDataWriter(self.writer_spec, path, **kwargs)


why? aren't those signatures explicit?

OK I get it now, but I think you could just take file_max_items from writer_spec in BufferedDataWriter . you do not need to pass any kwargs.

The problem is, that if I set an explicit "None" as an argument, it will not take this value from the BufferedDataWriterConfiguration. It seems the config values in constructors are only set for arguments that do not have an explicit value set during instantiation, which makes sense to me. This also is true for a 'None' value, so I need to call the BufferedDataWriter.init either with this argument or without it, but never with None.

rudolfix · 2025-03-13T20:09:03Z

dlt/extract/hints.py

@@ -68,11 +69,17 @@ class TResourceHints(TResourceHintsBase, total=False):


 class HintsMeta:
-    __slots__ = ("hints", "create_table_variant")
+    __slots__ = ("hints", "create_table_variant", "data_item_format")


I think you do not need to modify HintsMeta. Just create a container class in which you'll yield models ie

class ModelStr(str): pass

and yield it. Are you 100% sure we do not need any model properties to be stored in the file? or we plan to pass those via TableSchema? ie. look here:
https://docs.getdbt.com/reference/model-configs (materialization)
https://docs.getdbt.com/docs/build/incremental-strategy (this is our write disposition + primary key) heh cool

rudolfix · 2025-03-13T20:09:35Z

dlt/extract/resource.py

@@ -81,7 +86,7 @@ def with_hints(
    Create `TResourceHints` with `make_hints`.
    Setting `table_name` will dispatch the `item` to a specified table, like `with_table_name`
    """
-    return DataItemWithMeta(HintsMeta(hints, create_table_variant), item)
+    return DataItemWithMeta(HintsMeta(hints or {}, create_table_variant, data_item_format), item)


why default {} value?

i removed it, it was a leftover

rudolfix · 2025-03-13T20:09:53Z

dlt/extract/utils.py

@@ -60,14 +62,21 @@
    pandas = None


-def get_data_item_format(items: TDataItems) -> TDataItemFormat:
+def get_data_item_format(items: TDataItems, meta: Any = None) -> TDataItemFormat:


as mentioned, just check the item type

rudolfix · 2025-03-13T20:10:51Z

dlt/destinations/job_client_impl.py

@@ -279,6 +311,9 @@ def create_load_job(
        if SqlLoadJob.is_sql_job(file_path):
            # create sql load job
            return SqlLoadJob(file_path)
+        if ModelLoadJob.is_model_job(file_path):


hmmmm maybe we should register jobs via plugin as well?

sh-rp · 2025-03-14T11:57:28Z

"Are you 100% sure we do not need any model properties to be stored in the file? or we plan to pass those via TableSchema?"
I'm not. Conceptually it would be cool if the model files only had the instructions of where to take the data from and the rest is in the schema, so it is like all the other job files. I was planning to add views in the transformations, I would've added a new table format for this here, but in your review you said we don't need them, so now this is not here. For merging and upserting we already have the right hints which are in the schema, so I don't think anything else is required. But I'm not 100% sure at this point :)

rudolfix · 2025-03-16T15:31:31Z

dlt/common/storages/data_item_storage.py

            path = self._get_data_item_path_template(load_id, schema_name, table_name)
-            writer = BufferedDataWriter(self.writer_spec, path)
+            writer = BufferedDataWriter(self.writer_spec, path, **kwargs)


OK I get it now, but I think you could just take file_max_items from writer_spec in BufferedDataWriter . you do not need to pass any kwargs.

rudolfix · 2025-03-16T15:38:07Z

dlt/normalize/worker.py

@@ -111,7 +111,7 @@ def _get_items_normalizer(
            if item_format == "file":


I think item_format_from_file_extension should return model item format, not file for models. here this is not relevant but IMO may be in other places

It does already doesn't it? Or am I misunderstanding your comment.

if extension == "typed-jsonl": return "object" elif extension == "parquet": return "arrow" elif extension == "model": return "model"

sh-rp · 2025-03-17T14:31:30Z

tests/load/pipeline/test_model_item_format.py

@@ -0,0 +1,170 @@
+# test the sql insert job loader, works only on duckdb for now


I am not quite sure what else to test here, basically we are just testing wether these sql statements are created properly, all the column types etc should work since we are on the same destination and if they don't, there is not much we can do about it really, since we are not touching the data.

dlt/common/typing.py

sh-rp force-pushed the feat/2366-sql-jobs-2 branch 2 times, most recently from 33dbe36 to f2e10c5 Compare March 13, 2025 12:46

sh-rp commented Mar 13, 2025

View reviewed changes

Add support for model jobs

1daad9b

sh-rp force-pushed the feat/2366-sql-jobs-2 branch from f2e10c5 to 1daad9b Compare March 13, 2025 13:54

add correct insert job

83cc002

sh-rp force-pushed the feat/2366-sql-jobs-2 branch from 5d59485 to 83cc002 Compare March 13, 2025 14:47

sh-rp commented Mar 13, 2025

View reviewed changes

add some test skeletons

cc2d8ca

sh-rp changed the title ~~[tmp] model itemformat support~~ "model" item_format support Mar 13, 2025

rudolfix reviewed Mar 13, 2025

View reviewed changes

rudolfix reviewed Mar 16, 2025

View reviewed changes

sh-rp added 2 commits March 17, 2025 08:18

simplify modelhints

d874a6b

add more model tests

c7a9adf

sh-rp requested a review from rudolfix March 17, 2025 14:29

sh-rp commented Mar 17, 2025

View reviewed changes

anuunchin mentioned this pull request Mar 19, 2025

"model" item_format support takeover #2423

Open

anuunchin reviewed Mar 20, 2025

View reviewed changes

dlt/common/typing.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"model" item_format support #2404

"model" item_format support #2404

sh-rp commented Mar 13, 2025 •

edited

Loading

netlify bot commented Mar 13, 2025 •

edited

Loading

sh-rp Mar 13, 2025

rudolfix Mar 13, 2025

anuunchin Mar 20, 2025

sh-rp Mar 13, 2025

sh-rp Mar 13, 2025 •

edited

Loading

rudolfix Mar 13, 2025

sh-rp Mar 17, 2025

sh-rp Mar 13, 2025

rudolfix Mar 13, 2025

rudolfix left a comment

rudolfix Mar 13, 2025

rudolfix Mar 13, 2025

sh-rp Mar 14, 2025

rudolfix Mar 16, 2025

sh-rp Mar 17, 2025

sh-rp Mar 17, 2025

rudolfix Mar 13, 2025

rudolfix Mar 16, 2025

sh-rp Mar 17, 2025

rudolfix Mar 13, 2025

rudolfix Mar 13, 2025

sh-rp Mar 17, 2025

rudolfix Mar 13, 2025

rudolfix Mar 13, 2025

sh-rp commented Mar 14, 2025

rudolfix Mar 16, 2025

rudolfix Mar 16, 2025

sh-rp Mar 17, 2025 •

edited

Loading

sh-rp Mar 17, 2025

		@@ -111,7 +111,7 @@ def _get_items_normalizer(
		if item_format == "file":

		@@ -0,0 +1,170 @@
		# test the sql insert job loader, works only on duckdb for now

"model" item_format support #2404

Are you sure you want to change the base?

"model" item_format support #2404

Conversation

sh-rp commented Mar 13, 2025 • edited Loading

Description

netlify bot commented Mar 13, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Mar 13, 2025 •

edited

Loading

netlify bot commented Mar 13, 2025 •

edited

Loading

sh-rp Mar 13, 2025 •

edited

Loading

sh-rp Mar 17, 2025 •

edited

Loading