Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orquesta workflow gets stuck in running state #5029

Closed
igcherkaev opened this issue Aug 27, 2020 · 5 comments · Fixed by StackStorm/orquesta#213
Closed

Orquesta workflow gets stuck in running state #5029

igcherkaev opened this issue Aug 27, 2020 · 5 comments · Fixed by StackStorm/orquesta#213
Assignees
Milestone

Comments

@igcherkaev
Copy link

SUMMARY

Orquesta workflow gets stuck in running state, and with-items works incorrectly (at least it's different from mistral and not documented that it's this way by design). In mistral, task with with-items does not transition to another task until all items are processed, regardless whether they fail or not in the middle of the loop. In orquesta, if task with the first item fails, it immediately starts the task that's defined under failed() condition. If all items succeed, orquesta works exactly as mistral by processing all items first, then transitioning to the next task. I think this is a bug, which in its turn leads to the workflow never reaching final state and gets stuck in running state.

STACKSTORM VERSION

Paste the output of st2 --version:

st2 3.2.0, on Python 2.7.5
OS, environment, install method

Post what OS you are running this on, along with any other relevant information/

Steps to reproduce the problem

Meta:

---
pack: "playground"
name: "wf_orquesta_stuck3"
description: "Orquesta workflow gets stuck in running bug, st2 v3.2.0"
runner_type: orquesta
enabled: true
entry_point: "workflows/wf_orquesta_stuck3.yaml"

Workflow:

---
version: '1.0'

tasks:
  init_task:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do:
          - task_1
          - task_2
  task_1:
    with:
      items: i in <% ["1", "2"] %>
      concurrency: 1
    action: core.local
    input:
      cmd: "exit <% item(i) %>"
    next:
      - when: <% succeeded() or failed() %>
        do:
          - run_check_1

  task_2:
    with:
      items: i in <% ["0", "0"] %>
      concurrency: 1
    action: core.local
    input:
      cmd: "exit <% item(i) %>"
    next:
      - when: <% succeeded() or failed() %>
        do:
          - run_check_2

  run_check_1:
    with:
      items: i in <% ["0", "0"] %>
      concurrency: 1
    action: core.local
    input:
      cmd: "exit <% item(i) %>"
    next:
      - when: <% succeeded() %>
        do:
          - all_good
      - when: <% failed() %>
        do:
          - check_failed

  run_check_2:
    with:
      items: i in <% ["0", "0"] %>
      concurrency: 1
    action: core.local
    input:
      cmd: "exit <% item(i) %>"
    next:
      - when: <% succeeded() %>
        do:
          - all_good
      - when: <% failed() %>
        do:
          - check_failed

  all_good:
    join: all
    action: core.noop

  check_failed:
    action: core.noop
    next:
      - do:
          - fail

Expected Results

Two task_1 executed and fail, then two run_check_1 executed and finally complete the workflow and not get stuck.

Actual Results

image

So as can be seen here, right after task_1 failed, orquesta engine started run_check_1 task without waiting for another item in the list to be processed. But in case of task_2 it did wait for both of the items to be processed before transitioning to run_check_2, which is what expected in both cases. And this workflow never completes.

@igcherkaev
Copy link
Author

The provided workflow is a very simplified version of what I have here working fine with mistral but struggling to find a workaround in orquesta after conversion so far :(

@igcherkaev
Copy link
Author

The workflow can be simplified to:

---
version: '1.0'

tasks:
  init_task:
    action: core.noop
    next:
      - when: <% succeeded() %>
        do:
          - task_1

  task_1:
    with:
      items: i in <% ["1", "1"] %>
      concurrency: 1
    action: core.local
    input:
      cmd: "exit <% item(i) %>"
    next:
      - when: <% succeeded() or failed() %>
        do:
          - run_check_1

  run_check_1:
    action: core.local
    input:
      cmd: "exit 0"
    next:
      - when: <% succeeded() %>
        do:
          - all_good
      - when: <% failed() %>
        do:
          - check_failed

  all_good:
    action: core.noop

  check_failed:
    action: core.noop
    next:
      - do:
          - fail

@igcherkaev
Copy link
Author

Ok, I dug a bit into the sources of orquesta, and found a couple of things:

First:

https://github.com/StackStorm/orquesta/blob/master/orquesta/machines.py#L305:

        events.ACTION_FAILED_TASK_DORMANT_ITEMS_INCOMPLETE: statuses.FAILED,

Changed it to:

        events.ACTION_FAILED_TASK_DORMANT_ITEMS_INCOMPLETE: statuses.RUNNING,

and also added (probably not necessary, because in my case the one above was causing this issue, but I added it just in case, to complete the herd, so to speak):

        events.ACTION_FAILED_TASK_ACTIVE_ITEMS_INCOMPLETE: statuses.RUNNING,

That restored the logic of "process all items first, then transition to next task" based on the condition defined in when.

Second - and this one is a bit more complicated:

https://github.com/StackStorm/orquesta/blob/master/orquesta/conducting.py#L912-L915

            # Remove task from staging if exists but keep entry
            # if task has items and failed for manual rerun.
            if not (task_spec.has_items() and new_task_status in statuses.ABENDED_STATUSES):
                self.workflow_state.remove_staged_task(task_id, route)

This condition doesn't remove staged task when a task with with-items fails after processing all items leading to workflow getting stuck in the running state. When I remove the condition and remove staged task every time, this bug gets fixed, however, I don't know consequences for "manual rerun" mentioned there in the comments.

@m4dcoder could you suggest a better fix than removing the condition?

@igcherkaev
Copy link
Author

By the way, this also fixes #4968

igcherkaev added a commit to igcherkaev/orquesta that referenced this issue Aug 27, 2020
igcherkaev added a commit to igcherkaev/orquesta that referenced this issue Aug 27, 2020
@namachieli
Copy link

namachieli commented Aug 27, 2020

Possibly related - StackStorm/orquesta#212
Thank you for the troubleshooting on this!

@m4dcoder m4dcoder added this to the 3.3.0 milestone Aug 29, 2020
@m4dcoder m4dcoder self-assigned this Aug 29, 2020
copart-jafloyd pushed a commit to copartit/orquesta that referenced this issue May 5, 2021
StackStorm/st2#5029 (comment)

> That restored the logic of
> "process all items first, then transition to next task"
> based on the condition defined in when.

Originally in:
igcherkaev@2baa66d
copart-jafloyd pushed a commit to copartit/orquesta that referenced this issue Dec 31, 2021
StackStorm/st2#5029 (comment)

> That restored the logic of
> "process all items first, then transition to next task"
> based on the condition defined in when.

Originally in:
igcherkaev@2baa66d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants