-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dagster schedule ticks that fail and retry should indicate in the tick timeline that they retried #23549
Comments
Hey @gibsondan what is the expected value on the InsigationTick that indicates that it has failed and then succeeded? I'm guessing it's something like Do the tick timestamp and endTimestamp get overwritten during the retry? I guess I'm not sure that our tick model has all the information required to effectively represent both occurrences of the tick execution. Maybe the backend could create a new tick rather than mutating the failed one? |
@bengotow some backend change is going to be needed here first, yeah. Here's the relevant python code: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_scheduler/scheduler.py#L588-L596 - it would need to leave a better trace that this tick previously failed when this happens |
The problem with making a new tick is that with schedules we assume that there's at most one timestamp for a given schedule 'tick' and use that for checkpointing purposes. |
Tracking internally as https://linear.app/dagster-labs/issue/FE-508. |
@bengotow et al I finally made some progress on this: #26823 - there is a new field that is available on the frontend that we can use to distinguish between two ticks that are retries of the same 'scheduled execution time'. We just need to be sure to use that field when that is the information that we want to display. After that change goes in and we make some additional changes to schedule retry behavior on the backend (in progress here): #26824 we'll have two different ticks in the timeline to work with, with different actual timestamps but the same scheduled timestamp. |
What's the use case?
If a schedule tick fails due to a code server being temporarily unavailable, the tick will retry until the code server is available again. If that retry succeeds, the schedule timeline gives no indication of the original failure.
Furthermore, Dagster+ alerts will fire if this happens, but link to a timeline with no indication that there was a failure.
The feature request here is to indicate in the scheduler timeline that the tick originally failed.
Ideas of implementation
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: