You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When there is an error in the controller, the controller will retry immediately by adding the workqueue item back in the queue. This causes a loop of reconciliation that will affect the controller performance and other rollouts. Even if the queue is rate limited, multiple failing rollouts can affect the performance.
A rollout that consistently cause reconcile errors will also be considered "stuck" with little feedback available to the rollout object. The users need to contact the owner of the Rollout controller to investigate the logs and find information about why their Rollout is not progressing.
Use Cases
In the controller, an exponential backoff should be implemented when a rollout is in Error. The workqueue already has a concept of workqueue.NumRequeues(key) that allows to know how many time the rollout failed successively.
On error, the controller should add an event to the failing Rollout object. This would allow users to know that their Rollout is failing and how many times it retried.
To be even more useful, the event should contain the error allowing both users and rollout operators to see what is the cause of the error without having to dig into the controller pod logs. The error should be sanitized to avoid returning controller specific information in the stack trace to users. For that, a new ReconciliationError type could be created and used to wrap the current Error.
Finally, it should be possible to have that error in the Rollout status object after it has been retried for some time, similar to how Pods contain information in containerStatuses. This way, system like ArgoCD can use the information in the status to analyze the health of a Rollout, and a Rollout with Error could be considered as Degraded.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered:
Summary
When there is an error in the controller, the controller will retry immediately by adding the workqueue item back in the queue. This causes a loop of reconciliation that will affect the controller performance and other rollouts. Even if the queue is rate limited, multiple failing rollouts can affect the performance.
A rollout that consistently cause reconcile errors will also be considered "stuck" with little feedback available to the rollout object. The users need to contact the owner of the Rollout controller to investigate the logs and find information about why their Rollout is not progressing.
Use Cases
In the controller, an exponential backoff should be implemented when a rollout is in Error. The workqueue already has a concept of
workqueue.NumRequeues(key)
that allows to know how many time the rollout failed successively.On error, the controller should add an event to the failing Rollout object. This would allow users to know that their Rollout is failing and how many times it retried.
To be even more useful, the event should contain the error allowing both users and rollout operators to see what is the cause of the error without having to dig into the controller pod logs. The error should be sanitized to avoid returning controller specific information in the stack trace to users. For that, a new
ReconciliationError
type could be created and used to wrap the current Error.Finally, it should be possible to have that error in the Rollout status object after it has been retried for some time, similar to how Pods contain information in
containerStatuses
. This way, system like ArgoCD can use the information in the status to analyze the health of a Rollout, and a Rollout with Error could be considered as Degraded.Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: