Replies: 1 comment
-
Dagster allows you to perform backfills, which can be used to reprocess specific partitions when late data arrives. You can trigger a backfill for the affected partitions without reprocessing the entire dataset. This can be done manually, or automated using Declarative Automation to automatically materialize partitions when their upstream dependencies are updated. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey everyone,
I’m working on a data pipeline that processes data from fleet trackers once per day for reporting, using Dagster to orchestrate the workflow. A key challenge I’m facing is dealing with late-arriving data—sometimes due to connectivity issues, trackers upload data hours or even days after it’s initially generated. This means I need to reprocess data for previous days when new data comes in.
Currently, I’m using daily partitions, but I’m trying to figure out the best way to handle this without excessive manual intervention or reprocessing the entire dataset every time late data arrives.
It's my first time working with Dagster. My workflow is pulling data for the previous day (based on when the data arrived to us, not based on the actual timestamp), and then running that through a Pandas script to do some processing and spitting out a parquet file for each day. I know the end stage of the pipeline should be a DuckDB file, but not sure if I should be storing things in an intermediate format to make things easier for my use case.
Thanks in advance for any insights!
Beta Was this translation helpful? Give feedback.
All reactions