Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join failure within nested workflows can cause Parent workflow to run indefinitely. #212

Open
namachieli opened this issue Aug 7, 2020 · 0 comments

Comments

@namachieli
Copy link

Summary

Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.

Error Messages

I've seen two cases of error messages when this scenario presents

"message": "UnreachableJoinError: The join task|route \"aggregate|1\" is partially satisfied but unreachable."
"message": "The join task \"aggregate\" is unreachable. A join task is determined to be unreachable if there are nested forks from multi-referenced tasks that join on the said task. This is ambiguous to the workflow engine because it does not know at which level should the join occurs.",

The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)

image

Environment details

ST2 Version
st2 --version
st2 3.2.0, on Python 2.7.12
Distro
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
Other
  • Installed from one-liner script
  • Using EC2 server, not Docker or other virtualization
  • Running on AMI ami-0e32ec5bc225539f5 in AWS

Reproduction Workflow Examples

I have tested with these reproduction workflows that the problem presents itself.

Parent WF

parent_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: parent_wf
entry_point: workflows/parent_wf.yaml
parent_wf.yaml
version: 1.0
tasks:
  # [483, 337]
  task1:
    action: default.child_wf
    with:
      items: <% ctx(hosts) %>
      concurrency: 3
    next:
      - do:
          - complete
  # [483, 486]
  complete:
    action: core.noop
    join: all
vars:
  - hosts: ["host1","host2","host3"]

Child WF

child_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: child_wf
entry_point: workflows/child_wf.yaml
child_wf.yaml
version: 1.0
tasks:
  # [489, 163]
  run:
    action: core.noop
    next:
      - do:
          - succeeds
          - fails
  # [348, 313]
  succeeds:
    action: core.local
    input:
      cmd: echo 'success'
    next:
      # #1072c6
      - do:
          - aggregate

  # [666, 311]
  fails:
    action: core.local
    input:
      cmd: echo 'fail'; exit 1
    next:
      # #1072c6
      - do:
          - aggregate

  # [518, 461]
  aggregate:
    action: core.noop
    join: all
    next:
      # #629e47
      - do:
          - continue_wf

  # [518, 593]
  continue_wf:
    action: core.noop

Expected Result

  • Child workflow join fails because upstream action failure
  • Parent Workflow sees failure of child workflow
  • Parent Workflow waits for all child workflows to complete
  • Parent workflow moves onto complete action
  • Parent workflow enters Success/Failed State accordingly

Observed Result

  • Child workflow join fails because upstream action failure
  • Parent Workflow sees failure of child workflow
  • Parent Workflow waits for all child workflows to complete
  • Parent workflow moves onto complete action
  • Parent workflow continues in running State until canceled manually

Screen Shot 2020-08-07 at 3 39 27 PM

Workaround

An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a core.noop to ensure that a success always happens, which allows the join to succeed and continue gracefully.

This causes the "Expected Result" to be observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants