werf helm upgrade fails with `error processing rollout phase stage: error tracking resources:` but normal helm doesn't #6048

felipecrs · 2024-04-05T18:36:49Z

Before proceeding

I didn't find a similar issue

Version

1.2.305

How to reproduce

This is a bit difficult to reproduce because it does not happen every time. However I get this error with werf helm upgrade where helm upgrade --wait never fails for this.

Result

Upgrading release=jenkins, chart=/home/felipecrs/.cache/helmfile/https_github_com/jenkinsci/helm-charts/releases/download/jenkins-5.1.5/jenkins-5.1.5.tgz

┌ Waiting for resources to become ready
│ sts/jenkins ERROR: RecreatingFailedPod: StatefulSet default/jenkins is recreating failed Pod jenkins-0
│ 1/1 allowed errors occurred for sts/jenkins: continue tracking
│ sts/jenkins ERROR: FailedDelete: delete Pod jenkins-0 in StatefulSet jenkins failed error: pods "jenkins-0" not found
│ Allowed failures count for sts/jenkins exceeded 1 errors: stop tracking immediately!
│ 
│ ┌ Failed resource sts/jenkins service messages
│ │ added
│ │ po/jenkins-0 added
│ │ po/jenkins-0 added
│ │ event: RecreatingFailedPod: StatefulSet default/jenkins is recreating failed Pod jenkins-0
│ │ event: FailedDelete: delete Pod jenkins-0 in StatefulSet jenkins failed error: pods "jenkins-0" not found
│ └ Failed resource sts/jenkins service messages
└ Waiting for resources to become ready (2.01 seconds) FAILED

Error: UPGRADE FAILED: error processing rollout phase stage: error tracking resources: sts/jenkins failed: FailedDelete: delete Pod jenkins-0 in StatefulSet jenkins failed error: pods "jenkins-0" not found

FAILED RELEASES:
NAME      CHART                                                                                        VERSION   DURATION
jenkins   https://github.com/jenkinsci/helm-charts/releases/download/jenkins-5.1.5/jenkins-5.1.5.tgz                   4s

Expected result

It was supposed to retry until timeout, I believe. I believe that's what helm upgrade --wait does.

Additional information

If you need me to build a reproducible environment, please let me know.

The text was updated successfully, but these errors were encountered:

ilya-lesikov · 2024-04-05T18:52:18Z

When this happening can you check if the pod jenkins-0 has any issues? Like container failing to start or getting killed by a probe? We have our own engine to track resource statuses and errors, so we can react to resource failures faster and stop release/run auto-rollback without waiting for the timeout like Helm does.

By default we ignore the first error per pod, but fail the release if it happens again. This behavior can be configured with annotations:
https://werf.io/documentation/v1.2/reference/deploy_annotations.html#failures-allowed-per-replica
https://werf.io/documentation/v1.2/reference/deploy_annotations.html#fail-mode

felipecrs · 2024-04-05T20:44:14Z

I just caught the issue again. Then:

❯ k get pods
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   1/1     Running   0          39s
❯ k describe pod jenkins-0
Name:             jenkins-0
Namespace:        default
Priority:         0
Service Account:  jenkins
Node:             k3d-jenkins-agent-dind-test-server-0/172.28.0.2
Start Time:       Fri, 05 Apr 2024 17:42:25 -0300
Labels:           app.kubernetes.io/component=jenkins-controller
                  app.kubernetes.io/instance=jenkins
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=jenkins
                  controller-revision-hash=jenkins-7dd6558c9
                  statefulset.kubernetes.io/pod-name=jenkins-0
Annotations:      checksum/config: 9cda187dbf3e8e406c63fb2fa0f8b9be1282a524afe08f969aaa13634095fca5
Status:           Running
IP:               10.42.0.14
IPs:
  IP:           10.42.0.14
Controlled By:  StatefulSet/jenkins
Init Containers:
  init:
    Container ID:  containerd://2b16a7b993d311fa92aea7c6bfbee146191391e21a52caf53c10b53453f46d73
    Image:         jenkins-agent-dind-test-registry:5000/jenkins:latest
    Image ID:      jenkins-agent-dind-test-registry:5000/jenkins@sha256:87909327ff3bea4bcf6067f5be6fa7cfa0f714ef5d4fd75b68987dc17e284396
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      /var/jenkins_config/apply_config.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 05 Apr 2024 17:42:26 -0300
      Finished:     Fri, 05 Apr 2024 17:42:26 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:        50m
      memory:     256Mi
    Environment:  <none>
    Mounts:
      /var/jenkins_config from jenkins-config (rw)
      /var/jenkins_home from jenkins-home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m2nfn (ro)
Containers:
  jenkins:
    Container ID:  containerd://8cd20a8929bf0b96285dd9db763367a677d8f14817271408881b3874db382108
    Image:         jenkins-agent-dind-test-registry:5000/jenkins:latest
    Image ID:      jenkins-agent-dind-test-registry:5000/jenkins@sha256:87909327ff3bea4bcf6067f5be6fa7cfa0f714ef5d4fd75b68987dc17e284396
    Ports:         8080/TCP, 50000/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --httpPort=8080
    State:          Running
      Started:      Fri, 05 Apr 2024 17:42:27 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:      50m
      memory:   256Mi
    Liveness:   http-get http://:http/login delay=0s timeout=5s period=10s #success=1 #failure=5
    Readiness:  http-get http://:http/login delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:    http-get http://:http/login delay=0s timeout=5s period=10s #success=1 #failure=12
    Environment:
      SECRETS:                   /run/secrets/additional
      POD_NAME:                  jenkins-0 (v1:metadata.name)
      JAVA_OPTS:                 
      JENKINS_OPTS:              --webroot=/var/jenkins_cache/war 
      JENKINS_SLAVE_AGENT_PORT:  50000
      CASC_JENKINS_CONFIG:       /var/jenkins_home/casc_configs
    Mounts:
      /run/secrets/additional from jenkins-secrets (ro)
      /tmp from tmp-volume (rw)
      /var/jenkins_cache from jenkins-cache (rw)
      /var/jenkins_config from jenkins-config (ro)
      /var/jenkins_home from jenkins-home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m2nfn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jenkins-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      jenkins
    Optional:  false
  jenkins-secrets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          jenkins
    SecretOptionalName:  <nil>
  jenkins-cache:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  jenkins-home:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  jenkins
    ReadOnly:   false
  sc-config-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-m2nfn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  54s   default-scheduler  Successfully assigned default/jenkins-0 to k3d-jenkins-agent-dind-test-server-0
  Normal   Pulling    53s   kubelet            Pulling image "jenkins-agent-dind-test-registry:5000/jenkins:latest"
  Normal   Pulled     53s   kubelet            Successfully pulled image "jenkins-agent-dind-test-registry:5000/jenkins:latest" in 73.017453ms (73.025639ms including waiting)
  Normal   Created    53s   kubelet            Created container init
  Normal   Started    53s   kubelet            Started container init
  Normal   Pulling    52s   kubelet            Pulling image "jenkins-agent-dind-test-registry:5000/jenkins:latest"
  Normal   Pulled     52s   kubelet            Successfully pulled image "jenkins-agent-dind-test-registry:5000/jenkins:latest" in 57.934933ms (57.959749ms including waiting)
  Normal   Created    52s   kubelet            Created container jenkins
  Normal   Started    52s   kubelet            Started container jenkins
  Warning  Unhealthy  44s   kubelet            Startup probe failed: HTTP probe failed with statuscode: 503

felipecrs · 2024-04-05T20:45:36Z

Looks like the pod was indeed unhealthy for some time, probably during the time is was being terminated (as it takes some seconds to terminate, and healthcheck would fail during it).

felipecrs · 2024-04-05T20:46:59Z

I just wonder if werf should indeed be this agressive by default. My intention is just to replace helm calls in my pipelines to use werf helm instead, so if I could avoid making changes to chart templates, it'd be nicer.

ilya-lesikov · 2024-04-08T12:37:12Z

I see that this single failed startupProbe is not the reason:

  Warning  Unhealthy  44s   kubelet            Startup probe failed: HTTP probe failed with statuscode: 503

But this likely is:

Looks like the pod was indeed unhealthy for some time, probably during the time is was being terminated

since

│ │ event: RecreatingFailedPod: StatefulSet default/jenkins is recreating failed Pod jenkins-0
│ │ event: FailedDelete: delete Pod jenkins-0 in StatefulSet jenkins failed error: pods "jenkins-0" not found

Well, that wasn't our intention, probably we should ignore pod errors when the pod is terminating.

For now as a workaround add this annotations to your StatefulSet:
https://werf.io/documentation/v1.2/usage/deploy/tracking.html#disabling-state-tracking-and-ignoring-resource-errors-werf-only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

werf helm upgrade fails with `error processing rollout phase stage: error tracking resources:` but normal helm doesn't #6048

werf helm upgrade fails with `error processing rollout phase stage: error tracking resources:` but normal helm doesn't #6048

felipecrs commented Apr 5, 2024 •

edited

ilya-lesikov commented Apr 5, 2024

felipecrs commented Apr 5, 2024

felipecrs commented Apr 5, 2024 •

edited

felipecrs commented Apr 5, 2024

ilya-lesikov commented Apr 8, 2024

werf helm upgrade fails with error processing rollout phase stage: error tracking resources: but normal helm doesn't #6048

werf helm upgrade fails with error processing rollout phase stage: error tracking resources: but normal helm doesn't #6048

Comments

felipecrs commented Apr 5, 2024 • edited

Before proceeding

Version

How to reproduce

Result

Expected result

Additional information

ilya-lesikov commented Apr 5, 2024

felipecrs commented Apr 5, 2024

felipecrs commented Apr 5, 2024 • edited

felipecrs commented Apr 5, 2024

ilya-lesikov commented Apr 8, 2024

werf helm upgrade fails with `error processing rollout phase stage: error tracking resources:` but normal helm doesn't #6048

werf helm upgrade fails with `error processing rollout phase stage: error tracking resources:` but normal helm doesn't #6048

felipecrs commented Apr 5, 2024 •

edited

felipecrs commented Apr 5, 2024 •

edited