Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformations tutorial #2401

Draft
wants to merge 2 commits into
base: devel
Choose a base branch
from
Draft

transformations tutorial #2401

wants to merge 2 commits into from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Mar 12, 2025

Description

A first tutorial for our dlt+ python based transformations

Copy link

netlify bot commented Mar 12, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit ed5cb92
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67d813def617900008d166c6
😎 Deploy Preview https://deploy-preview-2401--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sh-rp sh-rp force-pushed the docs/transformations_tutorial branch from 0a3df73 to e323d4a Compare March 12, 2025 16:08
company_dataset = company_pipeline.dataset()

# print the employee table without the name column as arrow table
print(company_dataset.employees.drop("name").arrow())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the arrow table output. df() looks much nice imo

Suggested change
print(company_dataset.employees.drop("name").arrow())
print(company_dataset.employees.drop("name").df())

image


@dlt_plus.transform(transformation_type="python", write_disposition="replace")
def rows_per_table(dataset: SupportsReadableDataset):
# this is a speical relation on the dataset which computes row_counts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# this is a speical relation on the dataset which computes row_counts
# this is a special relation on the dataset which computes row_counts

return dataset.row_counts()


# we can group transformnations into a regular source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# we can group transformnations into a regular source
# we can group transformations into a regular source


## Introductory notes

To work through this tutorial, we will be using multiple local DuckDB instances to transform and sync data between them. In the real world, you will be able to use any destination/dataset that supports ibis expressions. We will also create an example dataset with a couple of tables to work with to demonstrate how you can make use of transformations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cooler to say it more general first and then specify the instances of the tutorial, something like:

"we will look at how to sync data between two destinations/dataset. In this tutorial we will simply be using two duckdb instances, but in the real world those could be anything that supports ibis expressions (link to list)"


## A first few transformations

Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is a bit circular:

Transformations take a populated dataset as input to build transformations from this source
not sure how to make it better though


Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline.

Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: a bit more polished:

Suggested change
Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons.
"Let's transfer some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be included in the warehouse, but the employee names should be excluded to ensure privacy."```

We have a straightforward company database with live employee master data containing names and payroll payments, as well as two tables for offices and departments. As always, you can inspect the contents of the dataset by running `dlt pipeline company_pipeline show` to open the dlt streamlit app.


## A first few transformations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first few or few first?

### Create aggregated tables in the same dataset

You can run a transformation against the same dataset that the source data comes from. This might be useful if you need some pre-computed stats (similar to a materialized view),
or you need to do some cleanup after a load on the same dataset. Let's use the same transformation as above plus another one in a group. Since the source and destination datasets are the same, we can use SQL transformations, which are much faster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concepts are quite clear, but I think it would be even nicer if there was a bit more of a didactic flow from one example to the next (explaining why we redo the payment_per_employee table and highlighting when data is moving between places (or not).

you're already saying it but in a more general way only. what I am thinking woul dbe along the lines of this:

"so lets say instead of having the transformation to create the payment_per_employee-stat write from our company-dataset directly into the warehouse, lets define it to work on the source dataset directly. That way we can use sql-transformations which are much faster. this might be useful if we need some pre-computed stats..."

# we can supply a column hint for the static_column, which will work for sql transformations
@dlt_plus.transform(transformation_type="sql", lineage_mode="best_effort", column_hints={"static_column": {"type": "double"}})
def mutated_transformation(dataset: SupportsReadableDataset):
return dataset.employees.mutate(static_column=1234)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one fails for me

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be columns=

:::


## A star schema example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: thats a helpful example, neatly packaged.

Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source:

```python
# copy only the employees and payroll tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learning question: whats the nicest way to actually reset the destination to verify the effect of those syncs? after the first one, the offices table is already there, so I am not really seeing a difference in pipeline state.
i've tried
pipeline.run(refresh="drop_data"), but that doesnt actually affect the data at the destination only

                * `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified.

(see here:

* `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified.
)

so i've just gone to manually deleting the star_pipeline directory in my ~/.dlt/pipelines folder and the .duckdb in the repo...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too

@anuunchin
Copy link
Contributor

anuunchin commented Mar 17, 2025

wondering why the python version here is 3.9, although the project level python is >= 3.10 (in general, not with regards to this PR 👀)

[tool.mypy]
python_version="3.9"


```

You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. dlt+ uses the same mechanism to derive precision and scale information for columns.

its not clear to me what that means? Are numbers (e.g. in financial applications) affected by precision changes? what does scale mean here?


You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.

You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is a bit long and would profit from being turned into some kind of a list or table or the like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- dlt+ is set up according to the [installation guide](installation.md)
- ibis-framework is installed in your environment
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why 'read' access? that confuses me here....is that something other than using ibis expressions to access the dataset? is it a different way to say this methods like limit and select won't work. Include any filtering or column selection directly in your SQL query.
after the reading the link, I still felt like I didnt check that box

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.

@djudjuu
Copy link
Collaborator

djudjuu commented Mar 18, 2025

general feedback

+I do like the feature a lot. its powerful and succint. sync_source() feels like magic! I never want to live without it again.

+the tutorial arrived swiftly and in a neat packaging (just kidding)

+It was easy to do though, and the transformations interface is pretty clear.

+Its great that there are many hints about stuff that will be supported/come out in the future. (although sometimes its in a tip, sometimes in the text, sometimes a bulleted list). Especially the star-schema template sounds like a really cool idea!

+the sql vs python section is pretty clear to me!

+doing the homework was fun! Its nice to accomplish that much with so few lines of code

-Tying into that, I always think the better the code-snippets that illustrate individual concepts tie into one another the better, kind of like hooking into the power of storytelling (if u get what i mean). (Counterexample, is when syncing with just a few tables, makes no visible change, because the table is already there.) It's not obvious how to do that, but i'd also volunteer to do it

-I am missing a general introductory definition of what transformations are
(my attempt: operations that modify, enrich or restructure raw data using sql or python expressions) that act as close as possible to your data and are thus very fast. wrapped in the dlt_plus.transform they do....). Quite possibly its obvious to dlt-users, but I still think that being more self-contained knowledge-wise (vs requiring much implicit knowledge) is a good value, potentially for newcomers (if we aim to be the defacto standard) and possibly as well for LLMs.
One easy sentence would be to mention where they go beyond what @dlt.transformers can define.

-I found no bugs, everything just worked, except when i tried to pipe a dlt_plus.transform into a dlt.transformer
(when I read that dlt_plus.transform with python is a resource under the hood I just had to try it)
-> i tried to pipe a dlt_transform into a dlt.transformer but failed...
maybe I did something wrong, here is a gist with it: https://gist.github.com/djudjuu/428fb5567d3846d5692213ad8ff665a4

@anuunchin
Copy link
Contributor

anuunchin commented Mar 19, 2025

  1. Is the way the transformations are defined clear?
    • Yes, it is super clear.
  2. Is it clear what lineage does?
    • I thought this was about from where a specific column stems, not which hints it inherits. Therefore, I was a bit confused.
  3. Is it clear how SQL and Python transformations differ, also from a technical point of view?
    • Yes, it is clear as well.
  4. Is it clear what the sync source does?
    • Yes.
  5. Are there any obvious features missing?
    • I don't really have a meaningful input on this.

On homework:

I started doing the homework by defining two pipelines for secure and public data with dlt.yml. And then I was confused as to how datasets are defined, so I went to the source code and was able to do it. Then, for some reason the cache was created as duckdb database even though I seemed to have set the destination to filesystem - but I'm pretty sure it was because of some leftover metadata from previous tries. However, I wasn't 100% sure how to manually clean up everything and it just kept persisting. Then, I couldn't continue with the sync step because I realized the idea (presumably) was just to use python scripts, not necessarily the project yml combined with ci. After this realization, the homework seemed very clear.

Might be a good idea to add very explicit examples for the whole dlt+ docs - but then again, I'm not used to yml projects, which explains my obliviousness.

Thought: In the docs, we should avoid defining the same thing for value and key. Say in dlt.yml, we have:

duckdb: duckdb

it's a bit confusing.

Copy link
Contributor

@akelad akelad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclosure: I did not do the homework, only the tutorial 😄

I left a lot of specific comments but some overall feedback:

  • I think the fact that the entire example is moving (and modifying ofc) data from one duckdb dataset to another duckdb dataset makes it feel very much like a toy example. So when we publish this to the public maybe an example of actually moving it to a warehouse or filesystem or something would be cool
  • I mentioned a few times that I'm not sure whether we're planning on releasing this tutorial to the general public. If we do, then I think we need to rewrite it a bit because there's a few confusing parts and sometimes too much unstructured text. It's not quite as easy to read as the other hackathons we've done. Ofc Solutions Eng is happy to help with that!
  • Naming of @dlt_plus.transform - I would consider making it @dlt.transform even if it's easily confusable with @dlt.transformer.

Also a general question - why would you do this instead of running transformations with dbt? Excuse my ignorance I am not a data engineer 🤠

- dlt+ is set up according to the [installation guide](installation.md)
- ibis-framework is installed in your environment
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.

Comment on lines +78 to +79
# adds a custom hint to the name column, we will use this to test lineage later
employees.apply_hints(columns={"name": {"x-pii": True}}) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should explain what this hint does, whether you can just write anything in the hint etc. Either here or in the lineage section later

print(company_dataset.employees.drop("name").arrow())

# create the above select as a transformation in the warehouse dataset
@dlt_plus.transform(transformation_type="python")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call this @dlt_plus.transform and not @dlt.transform? I know it's only possible with the dlt+ package but it would be smoother to just call it @dlt.transform. Though I do think we're running the risk of confusing people with the open source @dlt.transformer.


What is happening here?

* We have defined one new transformation function with the `dlt.transform` decorator. This function does not yield rows but returns a single object. This can be a SQL query as a string, but preferably it is a ReadableIbisRelation object that can be converted into a SQL query by dlt. You can use all ibis expressions detailed in the dlt oss docs. Note that you do not need to execute the query by running `.arrow()` or `.iter_df()` or something similar on it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I'm talking about with the IBIS dlt oss docs - I feel like that page doesn't explain much. Either way there should be a link to a page here.


## Lineage

dlt+ transformations contain a lineage engine that can trace the origin of columns resulting from transformations. You may have noticed that we added a custom hint to the name column in the employees table at the beginning of the page. This hint is a custom hint that we decided to add to all columns containing very sensitive data. Ideally, we would like to know which columns in a result are derived from columns containing sensitive data. dlt+ lineage will do just that for you. Let's run an aggregated join query into our warehouse again, but this time we will not drop the name column:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anuun already mentioned this, and I'm also confused about the lineage. You say here that it shows you the origin of the columns, i.e. where they came from, but then in the example all it seems to show is that the new column also has the same hint? What if multiple columns had that hint, then it wouldn't be clear where it came from right?


```

You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this stage it would be helpful to show example output of the command so you can explain the lineage better.


You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns.

You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset.

### Untraceable columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I would necessarily include this in a tutorial (if we're planning on publishing it to the public)


```

## SQL vs Python transformations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I don't fully understand when you would use one over the other apart from speed. Maybe the entire tutorial should contain examples for when one makes sense over the other instead of listing reasons down here (I know there's an example above for an SQL transformation but its just a brief mention like "you could use that here").

Comment on lines +385 to +386
## Sync Source
As you have observed above, there are cases where you essentially would like to copy a table from one dataset to another but do not really need to build any SQL expressions, because you do not want to filter or join. Let's say you simply want to copy the full company dataset into the warehouse dataset without any kind of data cleanup - you can very easily do this with the sync source. Under the hood, the sync source is a regular dlt source with a list of `dlt.transform` objects that run this sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we publish this as a public tutorial we should from the beginning describe the recommended methods for doing certain things. So use the sync_source function from the beginning of the tutorial

@rahuljo
Copy link
Contributor

rahuljo commented Mar 20, 2025

I also need to disclose that I've only done the tutorial and didn't have time to do the homework 😅

What I liked:

  1. I think the employees example is a very good choice for this tutorial, especially the star schema example and the lineage bit.
  2. The sync source is really cool! Very often I first load my data locally into duckdb and then to load it elsewhere I need to re-run the same code again, so the sync_source way of doing things would be pretty handy. Is there a reason why it's under dlt_plus.transform and not dlt_plus.sources?

What could be better:

  1. I was able to follow the tutorial without problems, but I did find myself constantly asking what is the advantage of doing this over any of the other multiple ways to transform data in dlt (dbt, sql_client). Is it because using @dlt_plus.transform lets you do things like specify write dispositions and have your transformations run on cache? I would re-phrase the tutorial that doesn't just focus on how to do the transformation but also on the advantage of doing it this way.
  2. +1 on Akela's point of adding output of the code snippets in the documentation to make it more readable. Especially in the star schema example, visualization for what the output looks like would make it a lot more understandable.

Copy link
Contributor

@adrianbr adrianbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to run the tutorial, I tried on my local and on codespaces. error AttributeError: module 'dlt_plus' has no attribute 'transform'

I read the docs and the interface looks good but i can see nothing about inspecting lineage and how that looks

Copy link
Contributor

@adrianbr adrianbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional feedback after running

  • I also care about non-pii lineage for troubleshooting


* The resulting table name is derived from the function name but may be set with the `table_name` parameter, like the `dlt.resource`.

* For now, we always need to set a transformation type. This can be `python` or `sql`. `sql` transformations will be executed directly as SQL on the dataset and are only available for transformations where the source_dataset and destination_dataset are the same or on the same physical destination. More on this below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or on the same physical destination. -> are the same or colocated and accessible from each other.

* Will also work if not all resulting columns can be traced to origin columns (unless specified otherwise, more on lineage below)

SQL transformations
* Can only be run if the source and destination dataset are the same or on the same physical destination. Currently there is no proper check, so using SQL in a place where you should not will result in an obscure error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the same or colocated and accessible from each other.

Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source:

```python
# copy only the employees and payroll tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants