Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Analysis Run never fails when the job image can't be pulled (it just succeeds and promotes). #3562

Open
2 tasks done
MohammedShetaya opened this issue May 8, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@MohammedShetaya
Copy link

MohammedShetaya commented May 8, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

Analysis Run of type job never fails when the job image can't be pulled (it just succeeds and promotes). When Job pods can't pull image it goes into ErrImagePull indefinitely and the job itself does not Fail. However, Argo rollouts waits for some time and considers this as success and then promotes the canary.

To Reproduce

  1. create a namespace called test-namespace
  2. run the following kubernetes configs
  3. Wait for the pods to come up
  4. change the deployment image to nginx:1.19.0 to trigger a rollout
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  namespace: test-namespace
spec:
  replicas: 0
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.9.8 # update it to 1.19.0 to trigger a rollout, as the first time will only create the pods.
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-stable
  namespace: test-namespace
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-canary
  namespace: test-namespace
spec:
  selector:
      app: nginx
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: acceptance-test
spec:
  metrics:
    - name: acceptance-test
      provider:
        job:
          spec:
            backoffLimit: 2
            completions: 1
            template:
              spec:
                containers:
                  - name: test
                    image: docker.io/library/lol:5 # this image does not exist.
                    command: ["bash"]
                    args: ["-c", "echo 'Hello world! going to exit with 0 (success).' && sleep 10 && exit 0"]
                restartPolicy: Never
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: awesome-rollout
  namespace: test-namespace
spec:
  replicas: 2
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-deployment
  strategy:
    canary:
      canaryService: nginx-canary
      stableService: nginx-stable
      analysis:
        templates:
          - templateName: acceptance-test
            clusterScope: true
      steps:
        - setWeight: 20
        - pause: {duration: 10s}
        - setWeight: 40
        - pause: {duration: 10s}
        - setWeight: 80
        - pause: {duration: 10s}

Expected behavior

  • It will succeed and promote and won't report a failure even though the job pod didn't start and was in ErrImagePull. It will also delete both the job and the rollout from the cluster.

Screenshots

Screenshot 2024-05-17 at 4 41 52 PM Screenshot 2024-05-17 at 4 42 09 PM

Version
Tested on v1.6.6 and previous versions.

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@MohammedShetaya MohammedShetaya added the bug Something isn't working label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant