Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUP-3496: Ensure token is returned to bucket when BK jobs are acquired and failed #535

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

petetomasik
Copy link
Contributor

While performing testing for #534, I uncovered a scenario where tryReturnToken() was not being called when failJob() was invoked to acquire and fail a job. This created a situation where there were no more available-tokens and num-in-flight would be capped out at max-in-flight. No Kubernetes jobs would be running and the controller would be unable to create any new Kubernetes jobs. Restarting the controller was the only way to resolve this situation, as the token bucket is initialized (filled) with max-in-flight tokens at Start().

Fixes #302

@petetomasik petetomasik requested a review from a team as a code owner March 6, 2025 16:07
Copy link
Contributor

@karensawrey karensawrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@petetomasik petetomasik merged commit 71f6512 into main Mar 7, 2025
1 check passed
@petetomasik petetomasik deleted the SUP-3496-return-token-on-fail-job branch March 7, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller stops accepting jobs from the cluster queue
2 participants