Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster schedule ticks that fail and retry should indicate in the tick timeline that they retried #23549

Open
gibsondan opened this issue Aug 9, 2024 · 6 comments
Labels
area: UI/UX Related to User Interface and User Experience type: feature-request

Comments

@gibsondan
Copy link
Member

What's the use case?

If a schedule tick fails due to a code server being temporarily unavailable, the tick will retry until the code server is available again. If that retry succeeds, the schedule timeline gives no indication of the original failure.

Furthermore, Dagster+ alerts will fire if this happens, but link to a timeline with no indication that there was a failure.

The feature request here is to indicate in the scheduler timeline that the tick originally failed.

Ideas of implementation

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@garethbrickman garethbrickman added the area: UI/UX Related to User Interface and User Experience label Aug 9, 2024
@github-project-automation github-project-automation bot moved this to Untriaged in Dagster UI/UX Aug 9, 2024
@salazarm
Copy link
Contributor

@salazarm salazarm moved this from Untriaged to Needs details in Dagster UI/UX Aug 12, 2024
@salazarm salazarm moved this from Needs details to Tracking internally in Dagster UI/UX Aug 12, 2024
@bengotow
Copy link
Collaborator

Hey @gibsondan what is the expected value on the InsigationTick that indicates that it has failed and then succeeded? I'm guessing it's something like error is not null AND status = success?

Do the tick timestamp and endTimestamp get overwritten during the retry? I guess I'm not sure that our tick model has all the information required to effectively represent both occurrences of the tick execution. Maybe the backend could create a new tick rather than mutating the failed one?

@gibsondan
Copy link
Member Author

@bengotow some backend change is going to be needed here first, yeah. Here's the relevant python code: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_scheduler/scheduler.py#L588-L596 - it would need to leave a better trace that this tick previously failed when this happens

@gibsondan
Copy link
Member Author

The problem with making a new tick is that with schedules we assume that there's at most one timestamp for a given schedule 'tick' and use that for checkpointing purposes.

@hellendag
Copy link
Member

Tracking internally as https://linear.app/dagster-labs/issue/FE-508.

@gibsondan
Copy link
Member Author

@bengotow et al I finally made some progress on this: #26823 - there is a new field that is available on the frontend that we can use to distinguish between two ticks that are retries of the same 'scheduled execution time'. We just need to be sure to use that field when that is the information that we want to display.

After that change goes in and we make some additional changes to schedule retry behavior on the backend (in progress here): #26824 we'll have two different ticks in the timeline to work with, with different actual timestamps but the same scheduled timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: UI/UX Related to User Interface and User Experience type: feature-request
Projects
Status: Tracking internally
Development

No branches or pull requests

5 participants