Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete #3569

SleepyBrett · 2024-05-11T02:31:34Z

Checklist:

[-] I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug

I'm still digging into this and collecting evidence. This is actually the second time this has happened, both with the same service also on the same cluster (it deploys to other clusters as well, no issues yet)

The first time this happened we upgraded argo-rollouts from 1.6.0 to 1.64 specifically because of the fixes mentioned as they lined up with our theory at the time.

Version: quay.io/argoproj/argo-rollouts:v1.6.4

Args:

      - args:
        - --healthzPort=8080
        - --metricsport=8090
        - --leader-elect

Rollout steps:

      steps:
      - setWeight: 15
      - pause:
          duration: 5m
      - setWeight: 50
      - pause:
          duration: 5m

however the analysis run performed 224 queries (1 per minute) for a grand total of ~3.45m

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  annotations:
    rollout.argoproj.io/revision: "73"
  creationTimestamp: "2024-05-10T00:00:02Z"
  generation: 227
  labels:
    app.kubernetes.io/instance: xxx
    app.kubernetes.io/name: microservice
    rollout-type: Background
    rollouts-pod-template-hash: 56457dc597
  name: xxx-microservice-56457dc597-73
  namespace: xxx
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Rollout
    name: xxx-microservice
    uid: 3bc07e9f-ea29-4b95-8c1a-b8aaea0cf069
  resourceVersion: "4829065944"
  uid: f17fe07d-5796-4690-9ca1-e533718caabe
spec:
  args:
  - name: canary-service-name
    value: xxx-microservice-canary
  - name: service-name
    value: xxx-microservice
  - name: service-namespace
    value: xxx
  metrics:
  - failureLimit: 3
    interval: 1m
    name: xxxx-microservice-response-code-500
    provider:
      prometheus:
        authentication:
          sigv4: {}
        query: |
          sum(irate(
            istio_requests_total{destination_service="{{args.canary-service-name}}.{{args.service-namespace}}.svc.cluster.local",kubernetes_namespace="{{args.service-namespace}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{destination_service="{{args.canary-service-name}}.{{args.service-namespace}}.svc.cluster.local",kubernetes_namespace="{{args.service-namespace}}"}[5m]
          ))
    successCondition: len(result) == 0 || result[0] <= 0.001 || isNaN(result[0])
  terminate: true
status:
  dryRunSummary: {}
  message: Run Terminated
  metricResults:
  - count: 224
    measurements:
    - finishedAt: "2024-05-10T03:34:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:34:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:35:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:35:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:36:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:36:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:37:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:37:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:38:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:38:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:39:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:39:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:40:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:40:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:41:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:41:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:42:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:42:02Z"
      value: '[0]'
    - finishedAt: "2024-05-10T03:43:02Z"
      phase: Successful
      startedAt: "2024-05-10T03:43:02Z"
      value: '[0]'
    metadata:
      ResolvedPrometheusQuery: |
        sum(irate(
          istio_requests_total{destination_service="xxx-microservice-canary.users.svc.cluster.local",kubernetes_namespace="xxx",response_code=~"5.*"}[5m]
        )) /
        sum(irate(
          istio_requests_total{destination_service="xxx-microservice-canary.users.svc.cluster.local",kubernetes_namespace="xxx"}[5m]
        ))
    name: xxx-microservice-response-code-500
    phase: Successful
    successful: 224
  phase: Successful
  runSummary:
    count: 1
    successful: 1
  startedAt: "2024-05-10T00:00:02Z"

Interestingly there are two separate rollouts of this service that happed at almost the exact same time, they both went into this behavior. The other though had a total of 223 successful runs however.

Other things that might be worth mentioning:
The rollout is managed by an HPA which currently scales on cpu and causes a replica count spike when the new pods come online.

What you see here is the podcounts for the replicasets for the two rollouts together:

orange and blue are canary/stable for one version of the service
yellow and green are canary/stable for the other version of the service

We are working with the customer to better configure their hpa but we're unsure if the constant scaling contributes to the root cause here.

To Reproduce

Unkown at this time. We handle many deployments a day, most with canary and the only bad behavior is this particular service and only twice now, both times it effected both services.

Possible that somehow the fix in 1.6.6 that deals with rapid redeploy before canary finishes could somehow fix this as well, is there maybe some label selector shenanigans that is causing it to be confused?

Expected behavior

I expect a 10m canary to complete or fail in ~10m

Screenshots

Screenshots above

Version
We've seen this in 1.6.0 and 1.6.4

Logs

Given that this is a 4 hour window the logs are quite extensive. If there are things I can look for specifically in the logs I'd be happy to, but it's not the only thing happening at the time.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

SleepyBrett · 2024-05-15T18:28:44Z

Update:
We've found more of these using the following promethues query:
(((sum_over_time(max by (name, source_eks_cluster) (analysis_run_info{phase="Running"})[24h:5m])) * 5)/60) > 1

Basically creates a value that is sum of the total 5m blocks in the last 24 hours with some math to put it into hours and taking longer than 1 hour. This seems more widespread than we previously thought.

We use notification engine to tell our pipelines when the rollout is complete using a when trigger obj.status.phase == 'Healthy' so sometime between the update that sets the Phase to healthy and when the controller 'finalizes' (rewrites the services/vs to 100, shuts down the analysis runs, etc) the rollout it's getting stuck for seemingly no reason.

SleepyBrett added the bug Something isn't working label May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete #3569

Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete #3569

SleepyBrett commented May 11, 2024

SleepyBrett commented May 15, 2024

Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete #3569

Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete #3569

Comments

SleepyBrett commented May 11, 2024

SleepyBrett commented May 15, 2024