Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP and Plugin nodes cannot be manually retried #14124

Open
4 tasks done
jswxstw opened this issue Jan 26, 2025 · 6 comments
Open
4 tasks done

HTTP and Plugin nodes cannot be manually retried #14124

jswxstw opened this issue Jan 26, 2025 · 6 comments
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@jswxstw
Copy link
Member

jswxstw commented Jan 26, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

#13734 refactored the retry logic, but HTTP and Plugin nodes cannot be manually retried after this change.

# ./dist/argo get http-template-98l24
Name:                http-template-98l24
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Message:             child 'http-template-98l24-3189028985' failed
Conditions:          
 PodRunning          False
 Completed           True
Created:             Sun Jan 26 15:02:00 +0800 (2 minutes ago)
Started:             Sun Jan 26 15:02:00 +0800 (2 minutes ago)
Finished:            Sun Jan 26 15:02:04 +0800 (2 minutes ago)
Duration:            4 seconds
Progress:            0/2
Parameters:          
  url:               https://raw.githubusercontent.com/argoproj/argo-workflows/thisisnotahash/pkg/apis/workflow/v1alpha1/generated.swagger.json

STEP                    TEMPLATE  PODNAME  DURATION  MESSAGE
 ✖ http-template-98l24  main                         child 'http-template-98l24-3189028985' failed  
 └─┬─✖ fail1            http                         received non-2xx response code: 404            
   └─✖ fail2            http                         received non-2xx response code: 404 
# ./dist/argo retry http-template-98l24 -p url="https://raw.githubusercontent.com/argoproj/argo-workflows/4e450e250168e6b4d51a126b784e90b11a0162bc/pkg/apis/workflow/v1alpha1/generated.swagger.json"
INFO[2025-01-26T15:04:33.446Z] Workflow to be dehydrated                     Workflow Size=4107
Name:                http-template-98l24
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:          
 PodRunning          False
 Completed           False
Created:             Sun Jan 26 15:02:00 +0800 (2 minutes ago)
Started:             Sun Jan 26 15:04:33 +0800 (now)
Duration:            0 seconds
Progress:            0/2
Parameters:          
  url:               https://raw.githubusercontent.com/argoproj/argo-workflows/4e450e250168e6b4d51a126b784e90b11a0162bc/pkg/apis/workflow/v1alpha1/generated.swagger.json

STEP                    TEMPLATE  PODNAME  DURATION  MESSAGE
 ● http-template-98l24  main                                                              
 └─┬─✖ fail1            http                         received non-2xx response code: 404  
   └─✖ fail2            http                         received non-2xx response code: 404

func shouldRetryFailedType(nodeTyp wfv1.NodeType) bool {
if nodeTyp == wfv1.NodeTypePod || nodeTyp == wfv1.NodeTypeContainer {
return true
}
return false
}

It seems that here missed HTTP and Plugin types.

Version(s)

c92ac36

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: http-template-
spec:
  entrypoint: main
  arguments:
    parameters:
      # good: https://raw.githubusercontent.com/argoproj/argo-workflows/4e450e250168e6b4d51a126b784e90b11a0162bc/pkg/apis/workflow/v1alpha1/generated.swagger.json
      # bad: https://raw.githubusercontent.com/argoproj/argo-workflows/thisisnotahash/pkg/apis/workflow/v1alpha1/generated.swagger.json
      - name: url
        value: "https://raw.githubusercontent.com/argoproj/argo-workflows/thisisnotahash/pkg/apis/workflow/v1alpha1/generated.swagger.json"
  templates:
    - name: main
      steps:
        - - name: fail1
            template: http
          - name: fail2
            template: http

    - name: http
      http:
        url: "{{workflow.parameters.url}}"

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@jswxstw jswxstw added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/bug type/regression Regression from previous behavior (a specific type of bug) labels Jan 26, 2025
@Joibel
Copy link
Member

Joibel commented Feb 3, 2025

Flagging @isubasinghe

@isubasinghe
Copy link
Member

isubasinghe commented Feb 3, 2025

Thanks for the bug report @jswxstw, this was intentional on my part but this can be reverted to the previous logic if required by adding the types you want the ability to retry in the shouldRetryFailedType function. I just didn't know if it made sense for Plugin types to be retried especially, as a result I opted to make it a bit more restrictive.

I cannot remember exactly why now but at the time I remember wanting to restrict the node types you could retry.

@jswxstw
Copy link
Member Author

jswxstw commented Feb 8, 2025

I cannot remember exactly why now but at the time I remember wanting to restrict the node types you could retry.

Failed, Error and Omitted non-group nodes would be deleted and retried in the past. HTTP and Plugin nodes are actually executable and need to be retried.

switch mustFind {
case wfv1.NodeTypePod:
curr, err = resetPod(curr, addToReset, addToDelete)
case wfv1.NodeTypeSteps:
curr, err = resetSteps(curr, addToReset)
case wfv1.NodeTypeStepGroup:
curr, err = resetStepGroup(curr, addToReset)
case wfv1.NodeTypeDAG:
curr, err = resetDAG(curr, addToReset)
case wfv1.NodeTypeTaskGroup:
curr, err = resetTaskGroup(curr, addToReset)

@isubasinghe I have some questions here: mustFind can only be NodeTypePod, resetSteps, resetStepGroup, resetDAG and resetTaskGroup will never be used. What is the purpose of designing them?

@isubasinghe
Copy link
Member

@jswxstw yeah you are right that mustFind only hits NodeTypePod, this was a left over from when it used to seek NodeTypeSteps whenever it encountered NodeTypeStepGroup.

It's basically some code that got leftover from an optimisation I did to seek boundaries instead.

@jswxstw
Copy link
Member Author

jswxstw commented Feb 20, 2025

@isubasinghe Do these codes need to be retained? If not, I can clean them up in the fix PR.

@isubasinghe
Copy link
Member

@jswxstw i think feel free to clean it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

3 participants