Skip to content

Join failure within nested workflows can cause Parent workflow to run indefinitely. #212

Open
@namachieli

Description

@namachieli

Summary

Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.

Error Messages

I've seen two cases of error messages when this scenario presents

"message": "UnreachableJoinError: The join task|route \"aggregate|1\" is partially satisfied but unreachable."
"message": "The join task \"aggregate\" is unreachable. A join task is determined to be unreachable if there are nested forks from multi-referenced tasks that join on the said task. This is ambiguous to the workflow engine because it does not know at which level should the join occurs.",

The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)

image

Environment details

ST2 Version
st2 --version
st2 3.2.0, on Python 2.7.12
Distro
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
Other
  • Installed from one-liner script
  • Using EC2 server, not Docker or other virtualization
  • Running on AMI ami-0e32ec5bc225539f5 in AWS

Reproduction Workflow Examples

I have tested with these reproduction workflows that the problem presents itself.

Parent WF

parent_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: parent_wf
entry_point: workflows/parent_wf.yaml
parent_wf.yaml
version: 1.0
tasks:
  # [483, 337]
  task1:
    action: default.child_wf
    with:
      items: <% ctx(hosts) %>
      concurrency: 3
    next:
      - do:
          - complete
  # [483, 486]
  complete:
    action: core.noop
    join: all
vars:
  - hosts: ["host1","host2","host3"]

Child WF

child_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: child_wf
entry_point: workflows/child_wf.yaml
child_wf.yaml
version: 1.0
tasks:
  # [489, 163]
  run:
    action: core.noop
    next:
      - do:
          - succeeds
          - fails
  # [348, 313]
  succeeds:
    action: core.local
    input:
      cmd: echo 'success'
    next:
      # #1072c6
      - do:
          - aggregate

  # [666, 311]
  fails:
    action: core.local
    input:
      cmd: echo 'fail'; exit 1
    next:
      # #1072c6
      - do:
          - aggregate

  # [518, 461]
  aggregate:
    action: core.noop
    join: all
    next:
      # #629e47
      - do:
          - continue_wf

  # [518, 593]
  continue_wf:
    action: core.noop

Expected Result

  • Child workflow join fails because upstream action failure
  • Parent Workflow sees failure of child workflow
  • Parent Workflow waits for all child workflows to complete
  • Parent workflow moves onto complete action
  • Parent workflow enters Success/Failed State accordingly

Observed Result

  • Child workflow join fails because upstream action failure
  • Parent Workflow sees failure of child workflow
  • Parent Workflow waits for all child workflows to complete
  • Parent workflow moves onto complete action
  • Parent workflow continues in running State until canceled manually

Screen Shot 2020-08-07 at 3 39 27 PM

Workaround

An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a core.noop to ensure that a success always happens, which allows the join to succeed and continue gracefully.

This causes the "Expected Result" to be observed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions