You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm still digging into this and collecting evidence. This is actually the second time this has happened, both with the same service also on the same cluster (it deploys to other clusters as well, no issues yet)
The first time this happened we upgraded argo-rollouts from 1.6.0 to 1.64 specifically because of the fixes mentioned as they lined up with our theory at the time.
Interestingly there are two separate rollouts of this service that happed at almost the exact same time, they both went into this behavior. The other though had a total of 223 successful runs however.
Other things that might be worth mentioning:
The rollout is managed by an HPA which currently scales on cpu and causes a replica count spike when the new pods come online.
What you see here is the podcounts for the replicasets for the two rollouts together:
orange and blue are canary/stable for one version of the service
yellow and green are canary/stable for the other version of the service
We are working with the customer to better configure their hpa but we're unsure if the constant scaling contributes to the root cause here.
To Reproduce
Unkown at this time. We handle many deployments a day, most with canary and the only bad behavior is this particular service and only twice now, both times it effected both services.
Possible that somehow the fix in 1.6.6 that deals with rapid redeploy before canary finishes could somehow fix this as well, is there maybe some label selector shenanigans that is causing it to be confused?
Expected behavior
I expect a 10m canary to complete or fail in ~10m
Screenshots
Screenshots above
Version
We've seen this in 1.6.0 and 1.6.4
Logs
Given that this is a 4 hour window the logs are quite extensive. If there are things I can look for specifically in the logs I'd be happy to, but it's not the only thing happening at the time.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered:
Update:
We've found more of these using the following promethues query: (((sum_over_time(max by (name, source_eks_cluster) (analysis_run_info{phase="Running"})[24h:5m])) * 5)/60) > 1
Basically creates a value that is sum of the total 5m blocks in the last 24 hours with some math to put it into hours and taking longer than 1 hour. This seems more widespread than we previously thought.
We use notification engine to tell our pipelines when the rollout is complete using a when trigger obj.status.phase == 'Healthy' so sometime between the update that sets the Phase to healthy and when the controller 'finalizes' (rewrites the services/vs to 100, shuts down the analysis runs, etc) the rollout it's getting stuck for seemingly no reason.
Checklist:
Describe the bug
I'm still digging into this and collecting evidence. This is actually the second time this has happened, both with the same service also on the same cluster (it deploys to other clusters as well, no issues yet)
The first time this happened we upgraded argo-rollouts from 1.6.0 to 1.64 specifically because of the fixes mentioned as they lined up with our theory at the time.
Version: quay.io/argoproj/argo-rollouts:v1.6.4
Args:
Rollout steps:
however the analysis run performed 224 queries (1 per minute) for a grand total of ~3.45m
Interestingly there are two separate rollouts of this service that happed at almost the exact same time, they both went into this behavior. The other though had a total of 223 successful runs however.
Other things that might be worth mentioning:
The rollout is managed by an HPA which currently scales on cpu and causes a replica count spike when the new pods come online.
What you see here is the podcounts for the replicasets for the two rollouts together:
We are working with the customer to better configure their hpa but we're unsure if the constant scaling contributes to the root cause here.
To Reproduce
Unkown at this time. We handle many deployments a day, most with canary and the only bad behavior is this particular service and only twice now, both times it effected both services.
Possible that somehow the fix in 1.6.6 that deals with rapid redeploy before canary finishes could somehow fix this as well, is there maybe some label selector shenanigans that is causing it to be confused?
Expected behavior
I expect a 10m canary to complete or fail in ~10m
Screenshots
Screenshots above
Version
We've seen this in 1.6.0 and 1.6.4
Logs
Given that this is a 4 hour window the logs are quite extensive. If there are things I can look for specifically in the logs I'd be happy to, but it's not the only thing happening at the time.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: