You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.
Error Messages
I've seen two cases of error messages when this scenario presents
"message": "UnreachableJoinError: The join task|route \"aggregate|1\" is partially satisfied but unreachable."
"message": "The join task \"aggregate\" is unreachable. A join task is determined to be unreachable if there are nested forks from multi-referenced tasks that join on the said task. This is ambiguous to the workflow engine because it does not know at which level should the join occurs.",
The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)
Child workflow join fails because upstream action failure
Parent Workflow sees failure of child workflow
Parent Workflow waits for all child workflows to complete
Parent workflow moves onto complete action
Parent workflow enters Success/Failed State accordingly
Observed Result
Child workflow join fails because upstream action failure
Parent Workflow sees failure of child workflow
Parent Workflow waits for all child workflows to complete
Parent workflow moves onto complete action
Parent workflow continues in running State until canceled manually
Workaround
An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a core.noop to ensure that a success always happens, which allows the join to succeed and continue gracefully.
This causes the "Expected Result" to be observed.
The text was updated successfully, but these errors were encountered:
Summary
Using a nested workflow, when a join fails due to "unreachable" in the child workflow can cause the parent workflow to run indefinitely, even though the parent workflow reaches an acceptable completion point.
Error Messages
I've seen two cases of error messages when this scenario presents
The longest I've seen it go, was until I manually canceled it the following day at 69,552 seconds (over 19 hours)
Environment details
ST2 Version
Distro
Other
ami-0e32ec5bc225539f5
in AWSReproduction Workflow Examples
I have tested with these reproduction workflows that the problem presents itself.
Parent WF
parent_wf.meta.yaml
parent_wf.yaml
Child WF
child_wf.meta.yaml
child_wf.yaml
Expected Result
join
fails because upstream action failurecomplete
actionObserved Result
join
fails because upstream action failurecomplete
actionrunning
State until canceled manuallyWorkaround
An acceptable workaround I have found is ensuring that each parallel silo (fork) of the child workflow, prior to being joined, has a
core.noop
to ensure that a success always happens, which allows the join to succeed and continue gracefully.This causes the "Expected Result" to be observed.
The text was updated successfully, but these errors were encountered: