Description
Summary
Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.
Error Messages
I've seen two cases of error messages when this scenario presents
"message": "UnreachableJoinError: The join task|route \"aggregate|1\" is partially satisfied but unreachable."
"message": "The join task \"aggregate\" is unreachable. A join task is determined to be unreachable if there are nested forks from multi-referenced tasks that join on the said task. This is ambiguous to the workflow engine because it does not know at which level should the join occurs.",
The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)
Environment details
ST2 Version
st2 --version
st2 3.2.0, on Python 2.7.12
Distro
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
Other
- Installed from one-liner script
- Using EC2 server, not Docker or other virtualization
- Running on AMI
ami-0e32ec5bc225539f5
in AWS
Reproduction Workflow Examples
I have tested with these reproduction workflows that the problem presents itself.
Parent WF
parent_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: parent_wf
entry_point: workflows/parent_wf.yaml
parent_wf.yaml
version: 1.0
tasks:
# [483, 337]
task1:
action: default.child_wf
with:
items: <% ctx(hosts) %>
concurrency: 3
next:
- do:
- complete
# [483, 486]
complete:
action: core.noop
join: all
vars:
- hosts: ["host1","host2","host3"]
Child WF
child_wf.meta.yaml
pack: default
enabled: true
runner_type: orquesta
name: child_wf
entry_point: workflows/child_wf.yaml
child_wf.yaml
version: 1.0
tasks:
# [489, 163]
run:
action: core.noop
next:
- do:
- succeeds
- fails
# [348, 313]
succeeds:
action: core.local
input:
cmd: echo 'success'
next:
# #1072c6
- do:
- aggregate
# [666, 311]
fails:
action: core.local
input:
cmd: echo 'fail'; exit 1
next:
# #1072c6
- do:
- aggregate
# [518, 461]
aggregate:
action: core.noop
join: all
next:
# #629e47
- do:
- continue_wf
# [518, 593]
continue_wf:
action: core.noop
Expected Result
- Child workflow
join
fails because upstream action failure - Parent Workflow sees failure of child workflow
- Parent Workflow waits for all child workflows to complete
- Parent workflow moves onto
complete
action - Parent workflow enters Success/Failed State accordingly
Observed Result
- Child workflow
join
fails because upstream action failure - Parent Workflow sees failure of child workflow
- Parent Workflow waits for all child workflows to complete
- Parent workflow moves onto
complete
action - Parent workflow continues in
running
State until canceled manually
Workaround
An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a core.noop
to ensure that a success always happens, which allows the join to succeed and continue gracefully.
This causes the "Expected Result" to be observed.