Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add try/except on release of work Unit and add force to workunit reaper #15129

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

tanganellilore
Copy link
Contributor

SUMMARY

In case we have some issue beetween execution node and AWX, and AWX will not catch that execution node is not working well or nor reachave or simply delete workunit (I don't identify exactly the use case but appen to me in 24.2.0 with execution node and ansible runne 1.4.3), workflow still wait the running state.
if we try to cancel the job/workflow via UI, we receive error below on awx-task pod and job never cancelled/stopped.

2024-04-23T09:11:30.367001675+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
             ^^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367012067+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
2024-04-23T09:11:30.367015375+02:00     return _call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367023213+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 687, in awx_receptor_workunit_reaper
    receptor_ctl.simple_command(f"work cancel {job.work_unit_id}")
2024-04-23T09:11:30.367031453+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
2024-04-23T09:11:30.367035057+02:00     return self.read_and_parse_json()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367042158+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: error cancelling remote unit:  unknown work unit wwXpmxdB

In thi PR i simply try/except the for cycle and demand the release to workunit reaper, where I put the force-release command instead of simple release.

I think that we need to force the release inside the for-cycle, because administrative_workunit_reaper check a lot of things on work unit side, that to me is not much sense because we already filter by ACTIVE_STATES on UnifiedJob filter.

If this is true, i can change it adding a force-relase command on exception in that way we are shure that works will be relased when cancel will be clicked on UI.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • Receptor
AWX VERSION
24.2.0
ADDITIONAL INFORMATION

@tanganellilore
Copy link
Contributor Author

hi @fosterseth ,
reformat as per discussion above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants