Skip to content

multi_extension fails on ubuntu 22.04 arm version #7933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alperkocatas opened this issue Mar 19, 2025 · 3 comments · May be fixed by #7950
Open

multi_extension fails on ubuntu 22.04 arm version #7933

alperkocatas opened this issue Mar 19, 2025 · 3 comments · May be fixed by #7950
Labels

Comments

@alperkocatas
Copy link
Collaborator

alperkocatas commented Mar 19, 2025

multi_extension fails on ubuntu 22.04 arm64 version running on a UTM virtual machine on MacOS (M1):

Here is the steps to reproduce the issue:

  • clone postgresql source code, configure, make and make install using REL_16_STABLE branch.
  • clone citus source code, configure, make and "make install-all" using main branch.
  • cd src/test/regress
  • run: pipenv run citus_tests/run_test.py multi_extension --use-base-schedule --use-whole-schedule-line

The test fails with following output:

waiting for server to start.... done
server started
waiting for server to start.... done
server started
waiting for server to start.... done
server started
CREATE DATABASE
CREATE EXTENSION
CREATE FUNCTION
CREATE FOREIGN DATA WRAPPER
CREATE DATABASE
CREATE EXTENSION
CREATE FUNCTION
CREATE FOREIGN DATA WRAPPER
# using postmaster on localhost, port 57636
not ok 1     - multi_extension                          1052 ms
# (test process exited with exit code 2)
1..1
# 1 of 1 tests failed.
# The differences that caused some tests to fail can be viewed in the file "/home/alperkocatas/citus/src/test/regress/regression.diffs".
# A copy of the test summary that you see above is saved in the file "/home/alperkocatas/citus/src/test/regress/regression.out".
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
Failed in 2 seconds. 

Last lines from regression.diffs:

-RESET citus.enable_schema_based_sharding;
-DROP EXTENSION citus;
-CREATE EXTENSION citus;
-DROP TABLE version_mismatch_table;
-DROP TABLE  multi_extension.extension_basic_types;
-DROP SCHEMA multi_extension;
-ERROR:  cannot drop schema multi_extension because other objects depend on it
-DETAIL:  function multi_extension.print_extension_changes() depends on schema multi_extension
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
+SSL SYSCALL error: EOF detected
+connection to server was lost

Apparently, the server crashes during the test. If we inspect the stack trace using the produced code-dump, we get the following:

libc.so.6!__pthread_kill_implementation(pthread_t threadid, int signo, int no_tid) (pthread_kill.c:44)
libc.so.6!__pthread_kill_internal(int signo, pthread_t threadid) (pthread_kill.c:78)
libc.so.6!__GI_raise(int sig) (raise.c:26)
libc.so.6!__GI_abort() (abort.c:79)
libc.so.6!__libc_message(enum __libc_message_action action, const char * fmt) (libc_fatal.c:156)
libc.so.6!__GI___fortify_fail(const char * msg) (fortify_fail.c:26)
libc.so.6!__stack_chk_fail() (stack_chk_fail.c:24)
citus.so!BuildCitusTableCacheEntry (Unknown Source:0)
citus.so!LookupCitusTableCacheEntry (Unknown Source:0)
citus.so!GetCitusTableCacheEntry (Unknown Source:0)
citus.so!InitializeTableCacheEntry (Unknown Source:0)
citus.so!LookupShardIdCacheEntry (Unknown Source:0)
citus.so!ShardPlacementList (Unknown Source:0)
citus.so!CreateSingleShardTableShardWithRoundRobinPolicy (Unknown Source:0)
citus.so!CreateSingleShardTableShard (Unknown Source:0)
citus.so!CreateCitusTable (Unknown Source:0)
citus.so!CreateSingleShardTable (Unknown Source:0)
citus.so!create_distributed_table (Unknown Source:0)
ExecInterpExpr (Unknown Source:0)
ExecInterpExprStillValid (Unknown Source:0)

Note: the same test can also fail sometimes on ubuntu x86 version - wsl running on windows. However, the test failure is not as consistent in the arm case. The stack trace is also slightly different in that case:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140561101195072) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140561101195072) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140561101195072, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fd6f06f6476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fd6f06dc7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00005600a29daebe in ExceptionalCondition ()
#6  0x00007fd6ee728de5 in MaintenanceDaemonShmemExit () from /home/alperkocatas/oss-postgres-bin/16/lib/citus.so
#7  0x00005600a28503a4 in shmem_exit ()
#8  0x00005600a28504a9 in proc_exit_prepare ()
#9  0x00005600a285054b in proc_exit ()
#10 0x00007fd6ee72a290 in CitusMaintenanceDaemonMain () from /home/alperkocatas/oss-postgres-bin/16/lib/citus.so
#11 0x00005600a27d11a6 in StartBackgroundWorker ()
#12 0x00005600a27d7c4b in do_start_bgworker ()
#13 0x00005600a27d7dc5 in maybe_start_bgworkers ()
#14 0x00005600a27d86f0 in process_pm_pmsignal ()
#15 0x00005600a27d8c9b in ServerLoop ()
#16 0x00005600a27da23c in PostmasterMain ()
#17 0x00005600a26f10b5 in main ()
@onurctirtir
Copy link
Member

Here is some part of our chat with @thanodnl on the reason behind this issue:

.. I have found a buffer overflow error in Citus caused by this line: https://github.com/citusdata/citus/pull/5314/files#diff-045f81e6097e6c11a1d01f091f8467aa0ef80204d62cbb357c4a6f67d4bf2f19R27 ..

.. overflow happens here: https://github.com/citusdata/citus/blob/main/src/backend/distributed/metadata/metadata_cache.c#L1736

And on arm64 it actually is a problem and gets reported as stack smashing ..
.. but dropping a column doesn't remove the column reference by its attribute number. Instead the column is marked as dropped in the catalog without removing any data (would require a table rewrite).

Then when you upgrade again the column gets added, rince and repeat.
It's not a big deal in production, as I think we have never downgraded anyone. However we do this in the testsuite quite often.

Now when the table gets read in the testsuite after it was dropped twice the attribute count went up to 8 from 6. Where we deform into an array of 6 items. That overruns the stack. ..

@onurctirtir
Copy link
Member

Note: the same test can also fail sometimes on ubuntu x86 version - wsl running on windows. However, the test failure is not as consistent in the arm case. The stack trace is also slightly different in that case:

This one is a separate issue that's documented in #5808.

@onurctirtir
Copy link
Member

onurctirtir commented Mar 20, 2025

And the actual issue documented here is same as #7515, so closing the older one to keep this one up in the issue list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants