How to run frequent ETLs with Dagster #25632
Replies: 4 comments
-
It's quite surprising to require such detailed partitioning. I don't have such detailed partitioning. I record the partition key and materialization time in a database and calculate the freshness of a partition asset by comparing the time interval between the last materialization time and the current time, with some weight calculations. Then, I use a sensor to query SQL and re-materialize old partitions that do not meet the freshness requirements. |
Beta Was this translation helpful? Give feedback.
-
Up! Hoping that there are some ideas floating around. There is an option to "half-solve" it with Asset Observations. For example an Hourly partitioned asset would run with a schedule every 5 minutes. Every 5 minutes it would ETL some data and it would mark the 5 minute chunk as completed (with Asset Observation). The partition itself would materialize when 5 minutes chunks for the whole hour are done. There's more to it code wise but that's the bigger picture. So it's doable. But as I said before this is only half the solution. If there is a downstream ETL asset that needs to be run in every 5 minutes as well then this approach won't work because how would the downstream asset know when upstream asset's 5 minute chunk is done. Yes, there's probably a way to build a rocket ship (sensor) which triggers the downstream but it's not scalable and I'm sure there are other corner cases which will make it extremely complex (e.g rerunning). |
Beta Was this translation helpful? Give feedback.
-
Although I like this (direction), IMHO, I think a good way to give Dagster this capability is to integrate with something like bytewax. It's probably a longer term project and it involves chosing which ETL provider to integrate with. Managing ETL with Dagster, as you've found, is not fun, as it isn't mean for that, but I'm with you on the expectation that if Dagster wants to be a single pane of glass for data observability, it has to also become SPOG for "observability pipelines". |
Beta Was this translation helpful? Give feedback.
-
Up! Hoping to get more traction on this topic. Is this something Dagster is moving towards in the future.? E.g developing perhaps an asset type of some kind to handle frequent ETLs. As far as I know then the integrations are missing the monitoring part as well (which works together with Dagster UI). You could use dlt with Dagster but visually and in code setting up the flows (e.g a chain of 3 assets where every asset is partitioned by 5 minutes) can't be done out of the box. Yes, you could create a custom PartitionsDefinition but that isn't scalable. Iirc optimal number of partitions for an asset is around ~100000? Correct me if I'm wrong. |
Beta Was this translation helpful? Give feedback.
-
Hoping to get some ideas if there's a way to implement frequent ETLs with Dagster's
AutoMaterializePolicy
/AutomationCondition
.What I strive to achieve is really simple. An ETL asset that's partitioned every 5 minutes. That's 288 partitions per day and a bit over 100000 partitions for a year. Might be okey if there's only 1 ETL asset but if you have bunch of them then using partitioned asset is not scalable.
I've seen some example where the ETL asset is unpartitioned and runs every 5 minutes. Totally doable but
Beta Was this translation helpful? Give feedback.
All reactions