-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transformations tutorial #2401
base: devel
Are you sure you want to change the base?
transformations tutorial #2401
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
0a3df73
to
e323d4a
Compare
company_dataset = company_pipeline.dataset() | ||
|
||
# print the employee table without the name column as arrow table | ||
print(company_dataset.employees.drop("name").arrow()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
@dlt_plus.transform(transformation_type="python", write_disposition="replace") | ||
def rows_per_table(dataset: SupportsReadableDataset): | ||
# this is a speical relation on the dataset which computes row_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# this is a speical relation on the dataset which computes row_counts | |
# this is a special relation on the dataset which computes row_counts |
return dataset.row_counts() | ||
|
||
|
||
# we can group transformnations into a regular source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# we can group transformnations into a regular source | |
# we can group transformations into a regular source |
|
||
## Introductory notes | ||
|
||
To work through this tutorial, we will be using multiple local DuckDB instances to transform and sync data between them. In the real world, you will be able to use any destination/dataset that supports ibis expressions. We will also create an example dataset with a couple of tables to work with to demonstrate how you can make use of transformations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cooler to say it more general first and then specify the instances of the tutorial, something like:
"we will look at how to sync data between two destinations/dataset. In this tutorial we will simply be using two duckdb instances, but in the real world those could be anything that supports ibis expressions (link to list)"
|
||
## A first few transformations | ||
|
||
Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence is a bit circular:
Transformations take a populated dataset as input to build transformations from this source
not sure how to make it better though
|
||
Transformations work very similarly to regular dlt.resources, which they are under the hood, and can be grouped into a regular `dlt.source`. Transformations take a populated dataset as input to build transformations from this source and move the computed data into the same or another dataset. If data stays within the same physical destination, you can execute the transformation as a SQL statement and no data needs to be moved to the machine where the dlt pipeline is running. For all other cases, data will be extracted the same way as with a regular dlt pipeline. | ||
|
||
Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: a bit more polished:
Let's move some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be available in the warehouse dataset. We want to make sure not to include employee names in the warehouse dataset for privacy reasons. | |
"Let's transfer some data from the company dataset into a fictional warehouse dataset. Let's assume we want the employee master data to be included in the warehouse, but the employee names should be excluded to ensure privacy."``` |
We have a straightforward company database with live employee master data containing names and payroll payments, as well as two tables for offices and departments. As always, you can inspect the contents of the dataset by running `dlt pipeline company_pipeline show` to open the dlt streamlit app. | ||
|
||
|
||
## A first few transformations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first few or few first?
### Create aggregated tables in the same dataset | ||
|
||
You can run a transformation against the same dataset that the source data comes from. This might be useful if you need some pre-computed stats (similar to a materialized view), | ||
or you need to do some cleanup after a load on the same dataset. Let's use the same transformation as above plus another one in a group. Since the source and destination datasets are the same, we can use SQL transformations, which are much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concepts are quite clear, but I think it would be even nicer if there was a bit more of a didactic flow from one example to the next (explaining why we redo the payment_per_employee table and highlighting when data is moving between places (or not).
you're already saying it but in a more general way only. what I am thinking woul dbe along the lines of this:
"so lets say instead of having the transformation to create the payment_per_employee-stat write from our company-dataset directly into the warehouse, lets define it to work on the source dataset directly. That way we can use sql-transformations which are much faster. this might be useful if we need some pre-computed stats..."
# we can supply a column hint for the static_column, which will work for sql transformations | ||
@dlt_plus.transform(transformation_type="sql", lineage_mode="best_effort", column_hints={"static_column": {"type": "double"}}) | ||
def mutated_transformation(dataset: SupportsReadableDataset): | ||
return dataset.employees.mutate(static_column=1234) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be columns=
::: | ||
|
||
|
||
## A star schema example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: thats a helpful example, neatly packaged.
Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source: | ||
|
||
```python | ||
# copy only the employees and payroll tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learning question: whats the nicest way to actually reset the destination to verify the effect of those syncs? after the first one, the offices
table is already there, so I am not really seeing a difference in pipeline state.
i've tried
pipeline.run(refresh="drop_data"), but that doesnt actually affect the data at the destination only
* `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified.
(see here:
Line 690 in 74689ac
* `drop_data` - Wipe all data and resource state for all resources being processed. Schema is not modified. |
so i've just gone to manually deleting the star_pipeline
directory in my ~/.dlt/pipelines
folder and the .duckdb in the repo...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too
wondering why the python version here is 3.9, although the project level python is >= 3.10 (in general, not with regards to this PR 👀)
|
|
||
``` | ||
|
||
You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. dlt+ uses the same mechanism to derive precision and scale information for columns.
its not clear to me what that means? Are numbers (e.g. in financial applications) affected by precision changes? what does scale mean here?
|
||
You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns. | ||
|
||
You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section is a bit long and would profit from being turned into some kind of a list or table or the like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
- dlt+ is set up according to the [installation guide](installation.md) | ||
- ibis-framework is installed in your environment | ||
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md) | ||
- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why 'read' access? that confuses me here....is that something other than using ibis expressions to access the dataset? is it a different way to say this methods like limit and select won't work. Include any filtering or column selection directly in your SQL query.
after the reading the link, I still felt like I didnt check that box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.
general feedback+I do like the feature a lot. its powerful and succint. +the tutorial arrived swiftly and in a neat packaging (just kidding) +It was easy to do though, and the transformations interface is pretty clear. +Its great that there are many hints about stuff that will be supported/come out in the future. (although sometimes its in a tip, sometimes in the text, sometimes a bulleted list). Especially the star-schema template sounds like a really cool idea! +the sql vs python section is pretty clear to me! +doing the homework was fun! Its nice to accomplish that much with so few lines of code -Tying into that, I always think the better the code-snippets that illustrate individual concepts tie into one another the better, kind of like hooking into the power of storytelling (if u get what i mean). (Counterexample, is when syncing with just a few tables, makes no visible change, because the table is already there.) It's not obvious how to do that, but i'd also volunteer to do it -I am missing a general introductory definition of what transformations are -I found no bugs, everything just worked, except when i tried to pipe a dlt_plus.transform into a dlt.transformer |
On homework:I started doing the homework by defining two pipelines for secure and public data with dlt.yml. And then I was confused as to how datasets are defined, so I went to the source code and was able to do it. Then, for some reason the cache was created as duckdb database even though I seemed to have set the destination to filesystem - but I'm pretty sure it was because of some leftover metadata from previous tries. However, I wasn't 100% sure how to manually clean up everything and it just kept persisting. Then, I couldn't continue with the sync step because I realized the idea (presumably) was just to use python scripts, not necessarily the project yml combined with ci. After this realization, the homework seemed very clear.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full disclosure: I did not do the homework, only the tutorial 😄
I left a lot of specific comments but some overall feedback:
- I think the fact that the entire example is moving (and modifying ofc) data from one duckdb dataset to another duckdb dataset makes it feel very much like a toy example. So when we publish this to the public maybe an example of actually moving it to a warehouse or filesystem or something would be cool
- I mentioned a few times that I'm not sure whether we're planning on releasing this tutorial to the general public. If we do, then I think we need to rewrite it a bit because there's a few confusing parts and sometimes too much unstructured text. It's not quite as easy to read as the other hackathons we've done. Ofc Solutions Eng is happy to help with that!
- Naming of
@dlt_plus.transform
- I would consider making it@dlt.transform
even if it's easily confusable with@dlt.transformer
.
Also a general question - why would you do this instead of running transformations with dbt? Excuse my ignorance I am not a data engineer 🤠
- dlt+ is set up according to the [installation guide](installation.md) | ||
- ibis-framework is installed in your environment | ||
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md) | ||
- you're familiar with our read access based on [ibis expressions](../../general-usage/dataset-access/dataset#modifying-queries-with-ibis-expressions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I'm confused by this - there's a section later as well which refers to this documentation page about all the ibis methods, and I don't feel like the documentation page linked to covers that properly.
# adds a custom hint to the name column, we will use this to test lineage later | ||
employees.apply_hints(columns={"name": {"x-pii": True}}) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should explain what this hint does, whether you can just write anything in the hint etc. Either here or in the lineage section later
print(company_dataset.employees.drop("name").arrow()) | ||
|
||
# create the above select as a transformation in the warehouse dataset | ||
@dlt_plus.transform(transformation_type="python") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we call this @dlt_plus.transform and not @dlt.transform? I know it's only possible with the dlt+ package but it would be smoother to just call it @dlt.transform
. Though I do think we're running the risk of confusing people with the open source @dlt.transformer
.
|
||
What is happening here? | ||
|
||
* We have defined one new transformation function with the `dlt.transform` decorator. This function does not yield rows but returns a single object. This can be a SQL query as a string, but preferably it is a ReadableIbisRelation object that can be converted into a SQL query by dlt. You can use all ibis expressions detailed in the dlt oss docs. Note that you do not need to execute the query by running `.arrow()` or `.iter_df()` or something similar on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I'm talking about with the IBIS dlt oss docs - I feel like that page doesn't explain much. Either way there should be a link to a page here.
|
||
## Lineage | ||
|
||
dlt+ transformations contain a lineage engine that can trace the origin of columns resulting from transformations. You may have noticed that we added a custom hint to the name column in the employees table at the beginning of the page. This hint is a custom hint that we decided to add to all columns containing very sensitive data. Ideally, we would like to know which columns in a result are derived from columns containing sensitive data. dlt+ lineage will do just that for you. Let's run an aggregated join query into our warehouse again, but this time we will not drop the name column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anuun already mentioned this, and I'm also confused about the lineage. You say here that it shows you the origin of the columns, i.e. where they came from, but then in the example all it seems to show is that the new column also has the same hint? What if multiple columns had that hint, then it wouldn't be clear where it came from right?
|
||
``` | ||
|
||
You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this stage it would be helpful to show example output of the command so you can explain the lineage better.
|
||
You can now inspect the schema of the warehouse and see that the name column of the aggregated table is also marked with our hint: `uv run dlt pipeline warehouse_pipeline schema`. dlt+ uses the same mechanism to derive precision and scale information for columns. | ||
|
||
You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
You can control the lineage mode with the `lineage_mode` parameter on the dlt.transform decorator, which can be set to `strict`, `best_effort`, or `disabled`. The default is `best_effort`, which will try to trace the lineage for each column and set the derived hints but will not fail if this is not possible. SQL-based transformations that introduce non-traceable columns will require you to supply column type hints for any columns that could not be traced. For Python-based transformations, the column type can be derived from the incoming data. In `strict` mode, any transformation that introduces non-traceable columns will fail with a useful error. This mode is recommended if you must ensure which columns may contain sensitive data and need to prevent them from being unmarked. You can also disable lineage for a specific transformation by setting the `lineage_mode` to `disabled`. In this mode, the full schema will be derived from the incoming data and no additional hints will be added to the resulting dataset. | ||
|
||
### Untraceable columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I would necessarily include this in a tutorial (if we're planning on publishing it to the public)
|
||
``` | ||
|
||
## SQL vs Python transformations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I don't fully understand when you would use one over the other apart from speed. Maybe the entire tutorial should contain examples for when one makes sense over the other instead of listing reasons down here (I know there's an example above for an SQL transformation but its just a brief mention like "you could use that here").
## Sync Source | ||
As you have observed above, there are cases where you essentially would like to copy a table from one dataset to another but do not really need to build any SQL expressions, because you do not want to filter or join. Let's say you simply want to copy the full company dataset into the warehouse dataset without any kind of data cleanup - you can very easily do this with the sync source. Under the hood, the sync source is a regular dlt source with a list of `dlt.transform` objects that run this sync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we publish this as a public tutorial we should from the beginning describe the recommended methods for doing certain things. So use the sync_source
function from the beginning of the tutorial
I also need to disclose that I've only done the tutorial and didn't have time to do the homework 😅 What I liked:
What could be better:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to run the tutorial, I tried on my local and on codespaces. error AttributeError: module 'dlt_plus' has no attribute 'transform'
I read the docs and the interface looks good but i can see nothing about inspecting lineage and how that looks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional feedback after running
- I also care about non-pii lineage for troubleshooting
|
||
* The resulting table name is derived from the function name but may be set with the `table_name` parameter, like the `dlt.resource`. | ||
|
||
* For now, we always need to set a transformation type. This can be `python` or `sql`. `sql` transformations will be executed directly as SQL on the dataset and are only available for transformations where the source_dataset and destination_dataset are the same or on the same physical destination. More on this below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or on the same physical destination. -> are the same or colocated and accessible from each other.
* Will also work if not all resulting columns can be traced to origin columns (unless specified otherwise, more on lineage below) | ||
|
||
SQL transformations | ||
* Can only be run if the source and destination dataset are the same or on the same physical destination. Currently there is no proper check, so using SQL in a place where you should not will result in an obscure error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the same or colocated and accessible from each other.
Creating a sync source from any dataset without additional parameters will copy all rows of all tables and retain the full lineage of the data. Lineage is set to `strict` mode. If you want to copy just a few tables, you can do so by passing a list of tables to the sync source: | ||
|
||
```python | ||
# copy only the employees and payroll tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on syncs - how would I manage separate incremental strategies per table? maybe it's a good idea to show the alternative too
Description
A first tutorial for our dlt+ python based transformations