Controller should return reconciliation error to Rollout object with exponential backoff #3547

agaudreault · 2024-04-30T14:45:29Z

Summary

When there is an error in the controller, the controller will retry immediately by adding the workqueue item back in the queue. This causes a loop of reconciliation that will affect the controller performance and other rollouts. Even if the queue is rate limited, multiple failing rollouts can affect the performance.

A rollout that consistently cause reconcile errors will also be considered "stuck" with little feedback available to the rollout object. The users need to contact the owner of the Rollout controller to investigate the logs and find information about why their Rollout is not progressing.

Use Cases

In the controller, an exponential backoff should be implemented when a rollout is in Error. The workqueue already has a concept of workqueue.NumRequeues(key) that allows to know how many time the rollout failed successively.

On error, the controller should add an event to the failing Rollout object. This would allow users to know that their Rollout is failing and how many times it retried.

To be even more useful, the event should contain the error allowing both users and rollout operators to see what is the cause of the error without having to dig into the controller pod logs. The error should be sanitized to avoid returning controller specific information in the stack trace to users. For that, a new ReconciliationError type could be created and used to wrap the current Error.

Finally, it should be possible to have that error in the Rollout status object after it has been retried for some time, similar to how Pods contain information in containerStatuses. This way, system like ArgoCD can use the information in the status to analyze the health of a Rollout, and a Rollout with Error could be considered as Degraded.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

agaudreault added the enhancement New feature or request label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller should return reconciliation error to Rollout object with exponential backoff #3547

Controller should return reconciliation error to Rollout object with exponential backoff #3547

agaudreault commented Apr 30, 2024

Controller should return reconciliation error to Rollout object with exponential backoff #3547

Controller should return reconciliation error to Rollout object with exponential backoff #3547

Comments

agaudreault commented Apr 30, 2024

Summary

Use Cases