Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) #15104

Open
5 of 11 tasks
akshat87 opened this issue Apr 11, 2024 · 4 comments

Comments

@akshat87
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

It should wait for workflow to complete in ansible tower but rather the remote connection is closed.

AWX version

24.2.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

N/A

Modifications

no

Ansible version

Ansible Automation Platform Controller 4.3.6

Operating system

redhat linux 8

Web browser

No response

Steps to reproduce

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Expected results

It should wait for workflow to complete in ansible tower but rather the remote connection is closed.

Actual results

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Additional information

No response

@fosterseth
Copy link
Member

do you see the jobs finish in the UI? how long do these workflows run, and how long did the cli command wait before returning the RemoteDisconnected error?

@XakV
Copy link

XakV commented Apr 26, 2024

I've seen similar in AWX 23.3.1. The template involves an ansible.builtin.uri call to VMware orchestrator, followed by an ansible.builtin.wait_for_connection. The job log pauses here:

Using module file /usr/local/lib/python3.11/site-packages/ansible/modules/ping.py
Pipelining is enabled.
<host.fqdn> ESTABLISH SSH CONNECTION FOR USER: $ANSIBLE_REMOTE_USER
<host.fqdn> SSH: EXEC ssh -vvv -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="$ANSIBLE_REMOTE_USER"' -o ConnectTimeout=120 -o 'ControlPath="/tmp/ansible-root-%h"' host.fqdn '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
wait_for_connection: attempting ping module test
sending connection check: [b'ssh', b'-vvv', b'-o', b'ServerAliveInterval=30', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60', b'-o', b'StrictHostKeyChecking=no', b'-o', b'UserKnownHostsFile=/dev/null', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="$ANSIBLE_REMOTE_USER"', b'-o', b'ConnectTimeout=120', b'-o', b'ControlPath="/tmp/ansible-root-%h"', b'-O', b'check', b'host.fqdn']

While the job log in the WebUI hangs here, the awx-task-runner-blah-blah repeatedly ( about every second or so ) attempts to make a connection.

You can watch the connection attempts by opening a shell in the task-runner container and identifying the parent ansible process/thread ID and then inferring the PID/TID from the active children, essentially ls -l /proc and if you suspected the child PID to be in the range of 200 to 399, while true; do cat /proc/[2,3]*/cmdline; done.

I can provide additional info if needed. The AWX install lives on a Rancher cluster running k8s v1.24.17 on rhel 7 hosts. Ingress is nginx, networking is Canal, pvc provided by portworx.

@XakV
Copy link

XakV commented Apr 26, 2024

Adding relevant bits of our ansible.cfg

defaults]

home = .ansible
roles_path    = roles
playbook_dir = playbooks
transport = smart
collections_path = .ansible/collections:/usr/share/ansible/collections:.venv/lib/python3.11/site-packages/ansible_collections/
remote_user = $ANSIBLE_REMOTE_USER
remote_tmp     = /tmp/$USER/.ansible
gather_subset = all
interpreter_python = auto
host_key_checking = False
timeout = 120
verbosity = 1
module_name = shell
ansible_managed = Ansible managed: {file} modified on %Y-%m-%d %H:%M by root on {host}
system_warnings = True
deprecation_warnings = True
command_warnings = False
callbacks_enabled = ansible.posix.profile_tasks
stdout_callback = yaml
display_skipped_hosts = False
retry_files_enabled = False
var_compression_level = 9
jinja2_extensions = jinja2.ext.do

[callback_profile_tasks]
task_output_limit = 5

[inventory]
enable_plugins=ansible.builtin.constructed, host_list, script, auto, yaml, ini, toml

[privilege_escalation]
become_ask_pass=False
become_method=sudo
become_flags="-iS"

[ssh_connection]

ssh_args = -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
control_path = /tmp/ansible-root-%%h
pipelining = True
transfer_method = smart

[persistent_connection]

connect_timeout = 30
connect_retries = 30
connect_interval = 1

Note that I'm substituting $ANSIBLE_REMOTE_USER for the actual user name.

@XakV
Copy link

XakV commented Apr 26, 2024

Found the error that ended the task above.

{"log":"2024-04-26 15:10:21,352 INFO [c3b7da2d511940cd9f42ad53edf60a96] awx.main.scheduler Workflow job 29241 failed due to reason: No error handling path for workflow job node(s) [(4838,error)]. Workflow job node(s) missing unified job template and error handling path [].\n","stream":"stderr","time":"2024-04-26T15:10:21.353827271Z"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants