Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX v24.3.0 update breaks deployment (psycopg connection issue?) #15136

Closed
5 of 11 tasks
ilbarone87 opened this issue Apr 23, 2024 · 15 comments
Closed
5 of 11 tasks

AWX v24.3.0 update breaks deployment (psycopg connection issue?) #15136

ilbarone87 opened this issue Apr 23, 2024 · 15 comments
Assignees

Comments

@ilbarone87
Copy link

ilbarone87 commented Apr 23, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Hello after updating to Awx-operator 2.16.0 with awx image 2.43.0 the Awx-Task pods is not able to start anymore and it throws these errors

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
READY
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,954 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
2024-04-23 23:36:28,011 INFO     [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=38
2024-04-23 23:36:28,023 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:30,307 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:30,313 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
double free or corruption (out)

After downgrading to 2.15 it starts to work again.
I have awx deployed on rke2 1.27 stable version. Awx is using an external Postgres 15.1 version

AWX version

2.43.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Ubuntu 22.04

Web browser

Safari

Steps to reproduce

Upgrade to Awx-oprator 2.16.0 and awx 2.43.0

Expected results

Awx to start correctly

Actual results

Awx-task pod in crash

Additional information

No response

@ryantking
Copy link

I think it has to do with this bug: https://www.postgresql.org/message-id/[email protected]

After some poking around I was able to find that the segfault occurs when creating the psycopg connection, and confirmed that its a known bug in this stackoverflow: https://stackoverflow.com/questions/77619281/double-free-or-corruption-out-from-psycopg2-connect-using-python-3-11-6

@Nyralei
Copy link

Nyralei commented Apr 24, 2024

Observed the same

@gundalow
Copy link
Contributor

@ilbarone87 @Nyralei thank you for reporting this and the troubleshooting you've done already.

We will look at this as a priority.

@gundalow gundalow pinned this issue Apr 24, 2024
@gundalow gundalow changed the title Awx 2.43.0 update breaks deployment AWX v24.3.0 update breaks deployment (psycopg connection issue?) Apr 24, 2024
@ilbarone87
Copy link
Author

ilbarone87 commented Apr 24, 2024

If this can help, have digged a little bit more into the issue and postgres was throwing this error:

{
  "@timestamp": [
    "2024-04-23T21:26:31.527Z"
  ],
  "container": [
    "postgres"
  ],
  "message": [
    "{\"level\":\"info\",\"ts\":\"2024-04-23T21:26:31Z\",\"logger\":\"postgres\",\"msg\":\"record\",\"logging_pod\":\"postgres-cluster-awx-1\",\"record\":{\"log_time\":\"2024-04-23 21:26:31.526 UTC\",\"process_id\":\"956926\",\"connection_from\":\"10.42.1.189:38644\",\"session_id\":\"66282787.e99fe\",\"session_line_num\":\"1\",\"session_start_time\":\"2024-04-23 21:26:31 UTC\",\"transaction_id\":\"0\",\"error_severity\":\"FATAL\",\"sql_state_code\":\"28000\",\"message\":\"no PostgreSQL user name specified in startup packet\",\"backend_type\":\"not initialized\",\"query_id\":\"0\"}}"
  ],
  "namespace": [
    "cnpg-system"
  ]

@dmzoneill
Copy link
Member

dmzoneill commented Apr 24, 2024

Hi folks,
So I have reviewed the source on postgresql website for the latest releases, and they are patched in relation to the linked bug. That's said we are still trying to confirm the source of the issue you have indicated.

As for downgrading awx-operator to 2.15 to resolve this issue, I can't see any difference between 2.15 and 2.16 would cause this issue. I suspect we are missing part of the story here.
https://gist.github.com/dmzoneill/497746f38c5786c96e8859f1131667af

What version of AWX did you upgrade from?

I'm using

bash-5.1$ rpm -qa | grep post
postgresql-private-libs-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-server-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-contrib-15.2-1.module_el9+264+92dde3f0.x86_64
bash-5.1$ pip3.11 list | grep cop
psycopg                     3.1.18
bash-5.1$ python3.11 --version
Python 3.11.7

If you could provide more specific versioning, that may help, thank you

@Nyralei
Copy link

Nyralei commented Apr 24, 2024

@dmzoneill
I think what @ilbarone87 actually means by downgrading to 2.15 version of awx-operator is that by downgrading operator awx image tag changes to 24.2.0 too. So the issue lies in awx 24.3.0 itself and not operator.
I observe the same error - "double free or corruption (out)" when changing image_version to 24.3.0 from 24.2.0.

@ilbarone87
Copy link
Author

@dmzoneill what @Nyralei is saying is correct, the problem is on the awx image 24.2.0, that's why I posted the issue here and not in the awx-operator repo. The mention of the operator version was just because they are bundled together.

@dmzoneill
Copy link
Member

Digging through the diff doesn't indicate a change this would cause this behaviour.
If you can continue to provide more info, that would be great.
thanks

git diff 24.2.0 24.3.0

@mensler
Copy link

mensler commented Apr 24, 2024

We have the same problem with 23.9.0 and the docker-compose setup, re-installed yesterday.

My guess is that is has to do with the base image, where OpenSSL got updated to 3.2.1 🤔

@kurokobo
Copy link
Contributor

This difference?

$ docker run --rm -it quay.io/ansible/awx:24.2.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.0.7-27.el9                        @baseos
openssl-libs.x86_64                        1:3.0.7-27.el9                        @System

$ docker run --rm -it quay.io/ansible/awx:24.3.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.2.1-1.el9                        @baseos
openssl-libs.x86_64                        1:3.2.1-1.el9                        @System

@fosterseth
Copy link
Member

quay.io/fosterseth/awx:openssl307

here is an image based on awx v24.3.0 but with a downgraded openssl

if someone wants to test that image out, that would be useful feedback

@Nyralei
Copy link

Nyralei commented Apr 24, 2024

@fosterseth It works, but now I encounter error with migrations mentioned in issue #15137

@fosterseth
Copy link
Member

@Nyralei thanks for quickly testing that.

I updated that image to pull in Alan's commit, if you want to re-pull and deploy again

fosterseth added a commit to fosterseth/awx that referenced this issue Apr 25, 2024
openssl 3.2.0 has incompatiblity issues with
the libpq version we are using, and causes
some C runtime errors:
"double free or corruption (out)"

see awx issue ansible#15136

also this issue

github.com/conan-io/conan-center-index/pull/22615

once the libpq libraries on centos stream9 are
updated with the patch, we can unpin openssl

Signed-off-by: Seth Foster <[email protected]>
thedoubl3j pushed a commit that referenced this issue Apr 25, 2024
openssl 3.2.0 has incompatiblity issues with
the libpq version we are using, and causes
some C runtime errors:
"double free or corruption (out)"

see awx issue #15136

also this issue

github.com/conan-io/conan-center-index/pull/22615

once the libpq libraries on centos stream9 are
updated with the patch, we can unpin openssl

Signed-off-by: Seth Foster <[email protected]>
@Nyralei
Copy link

Nyralei commented May 2, 2024

@fosterseth Updating to 24.3.1 fixed the issue

@ilbarone87
Copy link
Author

Solved for me too. Will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants