Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller should return reconciliation error to Rollout object with exponential backoff #3547

Open
agaudreault opened this issue Apr 30, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@agaudreault
Copy link
Member

Summary

When there is an error in the controller, the controller will retry immediately by adding the workqueue item back in the queue. This causes a loop of reconciliation that will affect the controller performance and other rollouts. Even if the queue is rate limited, multiple failing rollouts can affect the performance.

A rollout that consistently cause reconcile errors will also be considered "stuck" with little feedback available to the rollout object. The users need to contact the owner of the Rollout controller to investigate the logs and find information about why their Rollout is not progressing.

Use Cases

In the controller, an exponential backoff should be implemented when a rollout is in Error. The workqueue already has a concept of workqueue.NumRequeues(key) that allows to know how many time the rollout failed successively.

On error, the controller should add an event to the failing Rollout object. This would allow users to know that their Rollout is failing and how many times it retried.

To be even more useful, the event should contain the error allowing both users and rollout operators to see what is the cause of the error without having to dig into the controller pod logs. The error should be sanitized to avoid returning controller specific information in the stack trace to users. For that, a new ReconciliationError type could be created and used to wrap the current Error.

Finally, it should be possible to have that error in the Rollout status object after it has been retried for some time, similar to how Pods contain information in containerStatuses. This way, system like ArgoCD can use the information in the status to analyze the health of a Rollout, and a Rollout with Error could be considered as Degraded.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@agaudreault agaudreault added the enhancement New feature or request label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant