Handling data that arrives late #25065

NRaf · 2024-10-04T12:07:56Z

NRaf
Oct 4, 2024

Hey everyone,

I’m working on a data pipeline that processes data from fleet trackers once per day for reporting, using Dagster to orchestrate the workflow. A key challenge I’m facing is dealing with late-arriving data—sometimes due to connectivity issues, trackers upload data hours or even days after it’s initially generated. This means I need to reprocess data for previous days when new data comes in.

Currently, I’m using daily partitions, but I’m trying to figure out the best way to handle this without excessive manual intervention or reprocessing the entire dataset every time late data arrives.

It's my first time working with Dagster. My workflow is pulling data for the previous day (based on when the data arrived to us, not based on the actual timestamp), and then running that through a Pandas script to do some processing and spitting out a parquet file for each day. I know the end stage of the pipeline should be a DuckDB file, but not sure if I should be storing things in an intermediate format to make things easier for my use case.

Thanks in advance for any insights!

garethbrickman · 2024-10-16T20:28:42Z

garethbrickman
Oct 16, 2024
Maintainer

Dagster allows you to perform backfills, which can be used to reprocess specific partitions when late data arrives. You can trigger a backfill for the affected partitions without reprocessing the entire dataset. This can be done manually, or automated using Declarative Automation to automatically materialize partitions when their upstream dependencies are updated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling data that arrives late #25065

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Handling data that arrives late #25065

NRaf Oct 4, 2024

Replies: 1 comment

garethbrickman Oct 16, 2024 Maintainer

NRaf
Oct 4, 2024

garethbrickman
Oct 16, 2024
Maintainer