transformations tutorial #2401

sh-rp · 2025-03-12T16:03:11Z

Description

A first tutorial for our dlt+ python based transformations

netlify · 2025-03-12T16:03:33Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`ed5cb92`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67d813def617900008d166c6
😎 Deploy Preview	https://deploy-preview-2401--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

djudjuu · 2025-03-14T14:23:44Z

docs/website/docs/plus/getting-started/transformations.md

+company_dataset = company_pipeline.dataset()
+
+# print the employee table without the name column as arrow table
+print(company_dataset.employees.drop("name").arrow())


I don't like the arrow table output. df() looks much nice imo

Suggested change

print(company_dataset.employees.drop("name").arrow())

print(company_dataset.employees.drop("name").df())

djudjuu · 2025-03-14T15:00:16Z

docs/website/docs/plus/getting-started/transformations.md

+
+@dlt_plus.transform(transformation_type="python", write_disposition="replace")
+def rows_per_table(dataset: SupportsReadableDataset):
+    # this is a speical relation on the dataset which computes row_counts


Suggested change

# this is a speical relation on the dataset which computes row_counts

# this is a special relation on the dataset which computes row_counts

djudjuu · 2025-03-14T15:01:12Z

docs/website/docs/plus/getting-started/transformations.md

+    return dataset.row_counts()
+
+
+# we can group transformnations into a regular source


Suggested change

# we can group transformnations into a regular source

# we can group transformations into a regular source

djudjuu · 2025-03-17T08:21:30Z

docs/website/docs/plus/getting-started/transformations.md

+
+## Introductory notes
+
+To work through this tutorial, we will be using multiple local DuckDB instances to transform and sync data between them. In the real world, you will be able to use any destination/dataset that supports ibis expressions. We will also create an example dataset with a couple of tables to work with to demonstrate how you can make use of transformations.


I think it would be cooler to say it more general first and then specify the instances of the tutorial, something like:

"we will look at how to sync data between two destinations/dataset. In this tutorial we will simply be using two duckdb instances, but in the real world those could be anything that supports ibis expressions (link to list)"

djudjuu · 2025-03-17T08:30:41Z

docs/website/docs/plus/getting-started/transformations.md

+
+## A first few transformations 
+
+Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.


this sentence is a bit circular:

Transformations take a populated dataset as input to build transformations from this source
not sure how to make it better though

djudjuu · 2025-03-17T08:36:19Z

docs/website/docs/plus/getting-started/transformations.md

+
+Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.
+
+Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.


suggestion: a bit more polished:

Suggested change

Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.

"Let's transfer some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be included in the warehouse, but the employee names should be excluded to ensure privacy."```

djudjuu · 2025-03-17T08:41:09Z

docs/website/docs/plus/getting-started/transformations.md

+We have a straightforward company database with live employee master data containing names and payroll payments, as well as two tables for offices and departments. As always, you can inspect the contents of the dataset by running `dlt pipeline company_pipeline show` to open the dlt streamlit app.
+
+
+## A first few transformations 


first few or few first?

djudjuu · 2025-03-17T08:55:10Z

docs/website/docs/plus/getting-started/transformations.md

+### Create aggregated tables in the same dataset
+
+You can run a transformation against the same dataset that the source data comes from. This might be useful if you need some pre-computed stats (similar to a materialized view),
+or you need to do some cleanup after a load on the same dataset. Let's use the same transformation as above plus another one in a group. Since the source and destination datasets are the same, we can use SQL transformations, which are much faster.


I think the concepts are quite clear, but I think it would be even nicer if there was a bit more of a didactic flow from one example to the next (explaining why we redo the payment_per_employee table and highlighting when data is moving between places (or not).

you're already saying it but in a more general way only. what I am thinking woul dbe along the lines of this:

"so lets say instead of having the transformation to create the payment_per_employee-stat write from our company-dataset directly into the warehouse, lets define it to work on the source dataset directly. That way we can use sql-transformations which are much faster. this might be useful if we need some pre-computed stats..."

djudjuu · 2025-03-17T11:35:37Z

docs/website/docs/plus/getting-started/transformations.md

+# we can supply a column hint for the static_column, which will work for sql transformations
+@dlt_plus.transform(transformation_type="sql", lineage_mode="best_effort", column_hints={"static_column": {"type": "double"}})
+def mutated_transformation(dataset: SupportsReadableDataset):
+    return dataset.employees.mutate(static_column=1234)


this one fails for me

should be columns=

djudjuu · 2025-03-17T11:42:39Z

docs/website/docs/plus/getting-started/transformations.md

+:::
+
+
+## A star schema example


praise: thats a helpful example, neatly packaged.

djudjuu · 2025-03-17T12:07:12Z

docs/website/docs/plus/getting-started/transformations.md

+Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source:
+
+```python
+# copy only the employees and payroll tables


learning question: whats the nicest way to actually reset the destination to verify the effect of those syncs? after the first one, the offices table is already there, so I am not really seeing a difference in pipeline state.
i've tried
pipeline.run(refresh="drop_data"), but that doesnt actually affect the data at the destination only

* `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified.

(see here:

dlt/dlt/pipeline/pipeline.py

Line 690 in 74689ac

* `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified.

)

so i've just gone to manually deleting the star_pipeline directory in my ~/.dlt/pipelines folder and the .duckdb in the repo...

on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too

anuunchin · 2025-03-17T12:24:00Z

wondering why the python version here is 3.9, although the project level python is >= 3.10 (in general, not with regards to this PR 👀)

[tool.mypy]
python_version="3.9"

djudjuu · 2025-03-18T13:13:02Z

docs/website/docs/plus/getting-started/transformations.md

+
+```
+
+You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.


. dlt+ uses the same mechanism to derive precision and scale information for columns.

its not clear to me what that means? Are numbers (e.g. in financial applications) affected by precision changes? what does scale mean here?

djudjuu · 2025-03-18T13:24:38Z

docs/website/docs/plus/getting-started/transformations.md

+
+You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.
+
+You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.


this section is a bit long and would profit from being turned into some kind of a list or table or the like.

djudjuu · 2025-03-18T13:39:51Z

docs/website/docs/plus/getting-started/transformations.md

+- dlt+ is set up according to the [installation guide](installation.md)
+- ibis-framework is installed in your environment
+- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
+- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions)


question: why 'read' access? that confuses me here....is that something other than using ibis expressions to access the dataset? is it a different way to say this methods like limit and select won't work. Include any filtering or column selection directly in your SQL query.
after the reading the link, I still felt like I didnt check that box

+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.

djudjuu · 2025-03-18T13:52:02Z

general feedback

+I do like the feature a lot. its powerful and succint. sync_source() feels like magic! I never want to live without it again.

+the tutorial arrived swiftly and in a neat packaging (just kidding)

+It was easy to do though, and the transformations interface is pretty clear.

+Its great that there are many hints about stuff that will be supported/come out in the future. (although sometimes its in a tip, sometimes in the text, sometimes a bulleted list). Especially the star-schema template sounds like a really cool idea!

+the sql vs python section is pretty clear to me!

+doing the homework was fun! Its nice to accomplish that much with so few lines of code

-Tying into that, I always think the better the code-snippets that illustrate individual concepts tie into one another the better, kind of like hooking into the power of storytelling (if u get what i mean). (Counterexample, is when syncing with just a few tables, makes no visible change, because the table is already there.) It's not obvious how to do that, but i'd also volunteer to do it

-I am missing a general introductory definition of what transformations are
(my attempt: operations that modify, enrich or restructure raw data using sql or python expressions) that act as close as possible to your data and are thus very fast. wrapped in the dlt_plus.transform they do....). Quite possibly its obvious to dlt-users, but I still think that being more self-contained knowledge-wise (vs requiring much implicit knowledge) is a good value, potentially for newcomers (if we aim to be the defacto standard) and possibly as well for LLMs.
One easy sentence would be to mention where they go beyond what @dlt.transformers can define.

-I found no bugs, everything just worked, except when i tried to pipe a dlt_plus.transform into a dlt.transformer
(when I read that dlt_plus.transform with python is a resource under the hood I just had to try it)
-> i tried to pipe a dlt_transform into a dlt.transformer but failed...
maybe I did something wrong, here is a gist with it: https://gist.github.com/djudjuu/428fb5567d3846d5692213ad8ff665a4

anuunchin · 2025-03-19T18:16:01Z

Is the way the transformations are defined clear?
- Yes, it is super clear.
Is it clear what lineage does?
- I thought this was about from where a specific column stems, not which hints it inherits. Therefore, I was a bit confused.
Is it clear how SQL and Python transformations differ, also from a technical point of view?
- Yes, it is clear as well.
Is it clear what the sync source does?
- Yes.
Are there any obvious features missing?
- I don't really have a meaningful input on this.

On homework:

I started doing the homework by defining two pipelines for secure and public data with dlt.yml. And then I was confused as to how datasets are defined, so I went to the source code and was able to do it. Then, for some reason the cache was created as duckdb database even though I seemed to have set the destination to filesystem - but I'm pretty sure it was because of some leftover metadata from previous tries. However, I wasn't 100% sure how to manually clean up everything and it just kept persisting. Then, I couldn't continue with the sync step because I realized the idea (presumably) was just to use python scripts, not necessarily the project yml combined with ci. After this realization, the homework seemed very clear.

Might be a good idea to add very explicit examples for the whole dlt+ docs - but then again, I'm not used to yml projects, which explains my obliviousness.

Thought: In the docs, we should avoid defining the same thing for value and key. Say in dlt.yml, we have:
duckdb: duckdb
it's a bit confusing.

akelad

Full disclosure: I did not do the homework, only the tutorial 😄

I left a lot of specific comments but some overall feedback:

I think the fact that the entire example is moving (and modifying ofc) data from one duckdb dataset to another duckdb dataset makes it feel very much like a toy example. So when we publish this to the public maybe an example of actually moving it to a warehouse or filesystem or something would be cool
I mentioned a few times that I'm not sure whether we're planning on releasing this tutorial to the general public. If we do, then I think we need to rewrite it a bit because there's a few confusing parts and sometimes too much unstructured text. It's not quite as easy to read as the other hackathons we've done. Ofc Solutions Eng is happy to help with that!
Naming of @dlt_plus.transform - I would consider making it @dlt.transform even if it's easily confusable with @dlt.transformer.

Also a general question - why would you do this instead of running transformations with dbt? Excuse my ignorance I am not a data engineer 🤠

akelad · 2025-03-20T15:43:23Z

docs/website/docs/plus/getting-started/transformations.md

+- dlt+ is set up according to the [installation guide](installation.md)
+- ibis-framework is installed in your environment
+- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
+- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions)


+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.

akelad · 2025-03-20T15:46:47Z

docs/website/docs/plus/getting-started/transformations.md

+# adds a custom hint to the name column, we will use this to test lineage later
+employees.apply_hints(columns={"name": {"x-pii": True}})  # type: ignore


we should explain what this hint does, whether you can just write anything in the hint etc. Either here or in the lineage section later

akelad · 2025-03-20T15:48:26Z

docs/website/docs/plus/getting-started/transformations.md

+print(company_dataset.employees.drop("name").arrow())
+
+# create the above select as a transformation in the warehouse dataset
+@dlt_plus.transform(transformation_type="python")


why do we call this @dlt_plus.transform and not @dlt.transform? I know it's only possible with the dlt+ package but it would be smoother to just call it @dlt.transform. Though I do think we're running the risk of confusing people with the open source @dlt.transformer.

akelad · 2025-03-20T15:50:37Z

docs/website/docs/plus/getting-started/transformations.md

+
+What is happening here?
+
+* We have defined one new transformation function with the `dlt.transform` decorator. This function does not yield rows but returns a single object. This can be a SQL query as a string, but preferably it is a ReadableIbisRelation object that can be converted into a SQL query by dlt. You can use all ibis expressions detailed in the dlt oss docs. Note that you do not need to execute the query by running `.arrow()` or `.iter_df()` or something similar on it.


This is the part I'm talking about with the IBIS dlt oss docs - I feel like that page doesn't explain much. Either way there should be a link to a page here.

akelad · 2025-03-20T15:57:22Z

docs/website/docs/plus/getting-started/transformations.md

+
+## Lineage
+
+dlt+ transformations contain a lineage engine that can trace the origin of columns resulting from transformations. You may have noticed that we added a custom hint to the name column in the employees table at the beginning of the page. This hint is a custom hint that we decided to add to all columns containing very sensitive data. Ideally, we would like to know which columns in a result are derived from columns containing sensitive data. dlt+ lineage will do just that for you. Let's run an aggregated join query into our warehouse again, but this time we will not drop the name column:


Anuun already mentioned this, and I'm also confused about the lineage. You say here that it shows you the origin of the columns, i.e. where they came from, but then in the example all it seems to show is that the new column also has the same hint? What if multiple columns had that hint, then it wouldn't be clear where it came from right?

akelad · 2025-03-20T15:57:59Z

docs/website/docs/plus/getting-started/transformations.md

+
+```
+
+You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.


I think at this stage it would be helpful to show example output of the command so you can explain the lineage better.

akelad · 2025-03-20T15:58:06Z

docs/website/docs/plus/getting-started/transformations.md

+
+You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.
+
+You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.


akelad · 2025-03-20T15:58:43Z

docs/website/docs/plus/getting-started/transformations.md

+
+You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.
+
+### Untraceable columns


Not sure I would necessarily include this in a tutorial (if we're planning on publishing it to the public)

akelad · 2025-03-20T16:12:12Z

docs/website/docs/plus/getting-started/transformations.md

+
+```
+
+## SQL vs Python transformations


Tbh I don't fully understand when you would use one over the other apart from speed. Maybe the entire tutorial should contain examples for when one makes sense over the other instead of listing reasons down here (I know there's an example above for an SQL transformation but its just a brief mention like "you could use that here").

akelad · 2025-03-20T16:13:05Z

docs/website/docs/plus/getting-started/transformations.md

+## Sync Source
+As you have observed above, there are cases where you essentially would like to copy a table from one dataset to another but do not really need to build any SQL expressions, because you do not want to filter or join. Let's say you simply want to copy the full company dataset into the warehouse dataset without any kind of data cleanup - you can very easily do this with the sync source. Under the hood, the sync source is a regular dlt source with a list of `dlt.transform` objects that run this sync.


I think if we publish this as a public tutorial we should from the beginning describe the recommended methods for doing certain things. So use the sync_source function from the beginning of the tutorial

rahuljo · 2025-03-20T18:44:01Z

I also need to disclose that I've only done the tutorial and didn't have time to do the homework 😅

What I liked:

I think the employees example is a very good choice for this tutorial, especially the star schema example and the lineage bit.
The sync source is really cool! Very often I first load my data locally into duckdb and then to load it elsewhere I need to re-run the same code again, so the sync_source way of doing things would be pretty handy. Is there a reason why it's under dlt_plus.transform and not dlt_plus.sources?

What could be better:

I was able to follow the tutorial without problems, but I did find myself constantly asking what is the advantage of doing this over any of the other multiple ways to transform data in dlt (dbt, sql_client). Is it because using @dlt_plus.transform lets you do things like specify write dispositions and have your transformations run on cache? I would re-phrase the tutorial that doesn't just focus on how to do the transformation but also on the advantage of doing it this way.
+1 on Akela's point of adding output of the code snippets in the documentation to make it more readable. Especially in the star schema example, visualization for what the output looks like would make it a lot more understandable.

adrianbr

I was not able to run the tutorial, I tried on my local and on codespaces. error AttributeError: module 'dlt_plus' has no attribute 'transform'

I read the docs and the interface looks good but i can see nothing about inspecting lineage and how that looks

adrianbr

additional feedback after running

I also care about non-pii lineage for troubleshooting

adrianbr · 2025-03-21T10:26:32Z

docs/website/docs/plus/getting-started/transformations.md

+
+* The resulting table name is derived from the function name but may be set with the `table_name` parameter, like the `dlt.resource`.
+
+* For now, we always need to set a transformation type. This can be `python` or `sql`. `sql` transformations will be executed directly as SQL on the dataset and are only available for transformations where the source_dataset and destination_dataset are the same or on the same physical destination. More on this below.


or on the same physical destination. -> are the same or colocated and accessible from each other.

adrianbr · 2025-03-21T10:28:42Z

docs/website/docs/plus/getting-started/transformations.md

+* Will also work if not all resulting columns can be traced to origin columns (unless specified otherwise, more on lineage below)
+
+SQL transformations
+* Can only be run if the source and destination dataset are the same or on the same physical destination. Currently there is no proper check, so using SQL in a place where you should not will result in an obscure error


are the same or colocated and accessible from each other.

adrianbr · 2025-03-21T10:30:29Z

docs/website/docs/plus/getting-started/transformations.md

+Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source:
+
+```python
+# copy only the employees and payroll tables


on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too

fix anchor

e323d4a

sh-rp force-pushed the docs/transformations_tutorial branch from 0a3df73 to e323d4a Compare March 12, 2025 16:08

djudjuu reviewed Mar 14, 2025

View reviewed changes

djudjuu reviewed Mar 17, 2025

View reviewed changes

fix lineage columns hint parameter name

ed5cb92

djudjuu reviewed Mar 18, 2025

View reviewed changes

akelad reviewed Mar 20, 2025

View reviewed changes

adrianbr reviewed Mar 21, 2025

View reviewed changes

	print(company_dataset.employees.drop("name").arrow())
	print(company_dataset.employees.drop("name").df())

	# this is a speical relation on the dataset which computes row_counts
	# this is a special relation on the dataset which computes row_counts

		return dataset.row_counts()


		# we can group transformnations into a regular source

	# we can group transformnations into a regular source
	# we can group transformations into a regular source


		## Introductory notes

		To work through this tutorial, we will be using multiple local DuckDB instances to transform and sync data between them. In the real world, you will be able to use any destination/dataset that supports ibis expressions. We will also create an example dataset with a couple of tables to work with to demonstrate how you can make use of transformations.


		## A first few transformations

		Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.


		Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.

		Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.

	Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.
	"Let's transfer some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be included in the warehouse, but the employee names should be excluded to ensure privacy."```

		We have a straightforward company database with live employee master data containing names and payroll payments, as well as two tables for offices and departments. As always, you can inspect the contents of the dataset by running `dlt pipeline company_pipeline show` to open the dlt streamlit app.


		## A first few transformations


		```

		You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.


		You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.

		You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.

		# adds a custom hint to the name column, we will use this to test lineage later
		employees.apply_hints(columns={"name": {"x-pii": True}}) # type: ignore


		What is happening here?

		* We have defined one new transformation function with the `dlt.transform` decorator. This function does not yield rows but returns a single object. This can be a SQL query as a string, but preferably it is a ReadableIbisRelation object that can be converted into a SQL query by dlt. You can use all ibis expressions detailed in the dlt oss docs. Note that you do not need to execute the query by running `.arrow()` or `.iter_df()` or something similar on it.


		## Lineage

		dlt+ transformations contain a lineage engine that can trace the origin of columns resulting from transformations. You may have noticed that we added a custom hint to the name column in the employees table at the beginning of the page. This hint is a custom hint that we decided to add to all columns containing very sensitive data. Ideally, we would like to know which columns in a result are derived from columns containing sensitive data. dlt+ lineage will do just that for you. Let's run an aggregated join query into our warehouse again, but this time we will not drop the name column:


		You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.

		### Untraceable columns

		## Sync Source
		As you have observed above, there are cases where you essentially would like to copy a table from one dataset to another but do not really need to build any SQL expressions, because you do not want to filter or join. Let's say you simply want to copy the full company dataset into the warehouse dataset without any kind of data cleanup - you can very easily do this with the sync source. Under the hood, the sync source is a regular dlt source with a list of `dlt.transform` objects that run this sync.


		* The resulting table name is derived from the function name but may be set with the `table_name` parameter, like the `dlt.resource`.

		* For now, we always need to set a transformation type. This can be `python` or `sql`. `sql` transformations will be executed directly as SQL on the dataset and are only available for transformations where the source_dataset and destination_dataset are the same or on the same physical destination. More on this below.

transformations tutorial #2401

Are you sure you want to change the base?

transformations tutorial #2401

Conversation

sh-rp commented Mar 12, 2025

Description

netlify bot commented Mar 12, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuunchin commented Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djudjuu commented Mar 18, 2025

general feedback

anuunchin commented Mar 19, 2025 • edited Loading

On homework:

akelad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahuljo commented Mar 20, 2025

adrianbr left a comment

Choose a reason for hiding this comment

adrianbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Mar 12, 2025 •

edited

Loading

anuunchin commented Mar 17, 2025 •

edited

Loading

anuunchin commented Mar 19, 2025 •

edited

Loading