Skip to content

MDEV-36316/MDEV-36327/MDEV-36328 Msan debug innodb #4061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 28, 2025

Conversation

grooverdan
Copy link
Member

  • The Jira issue number for this PR is: MDEV-36316/MDEV-36327/MDEV-36328

Description

MSAN + CMAKE_BUILD_TYPE=Debug has stack overrun issues

Just a POC at the moment. Some tests still failing:

main.subselect_sj2_jcl6                  w2 [ fail ]
        Test ended at 2025-05-23 10:34:57

CURRENT_TEST: main.subselect_sj2_jcl6
mysqltest: In included file "/source/mysql-test/main/subselect_sj2.test": 
included from /source/mysql-test/main/subselect_sj2_jcl6.test at line 18:
At line 893: query 'call p1()' failed: ER_STACK_OVERRUN_NEED_MORE (1436): Thread stack overrun:  444864 bytes used of a 458752 byte stack, and 16000 bytes needed. Consider increasing the thread_stack system variable.

Release Notes

non-user facing

How can this PR be tested?

CMAKE_BUILD_TYPE=Debug under MSAN like https://mariadb.com/kb/en/compile-and-using-mariadb-with-sanitizers-asan-ubsan-tsan-msan/ but without CMAKE_C,CXX_FLAGS=-O2

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ grooverdan
❌ dr-m
You have signed the CLA already but the status is still pending? Let us recheck it.

@svoj svoj added the MariaDB Foundation Pull requests created by MariaDB Foundation label May 23, 2025
grooverdan and others added 8 commits May 28, 2025 16:28
Clang ~16+ on MSAN became quite strict with uninitalized
data being passed and returned from functions. Non-debug builds
have a basic optimization that hides these from those builds

Two innodb cases violate the assumptions, however once inlined
with a basic optimization those that existed for uninitialized
values are removed.

(MDEV-36316) rec_set_bit_field_2 calling mach_read_from_2 hits a read of
bits it wasn't actually changing.

(MDEV-36327) The function dict_process_sys_columns_rec left
nth_v_col uninitialized unless it was a virtual column. This was
ok as the function i_s_sys_columns_fill_table also didn't read
this value unless it was a virtual column.
The stacksize was still too large. Back port from the 10.11 merge
commit 1c7209e
In original fix, commit 82d7419,
a 16k stack frame limit was imposed. Under the stack usage is doubled due
to MSAN. Debug builds without optimization can use more as well.

ASAN Debug builds also exceeded the 16k stack frame limit.

To keep some safety limit, a 64k limit is imposed to the compiler
under MSAN or ASAN with CMAKE_BUILD_TYPE=Debug.
Despite being included in the HAVE_valgrind define.

As such it's best differenciated from valgrind in the
server identifier as they have for the purposes a distinct
and different set of behaviours.

MSAN has its own set of test inclusions that that are different
from valgrind and such including "valgrind" in a server string that
gets tested for valgrind will incorrectly exclude some tests
that are suitable for MSAN but not valgrind.

There's a have_sanitizer system variable for exposing
the sanitizer being used so there's no need for
version verboseness.

Correct have_sanitizer system variable description to
include MSAN has been possible for a while.
Without this increase the mtr test case pre/post conditions will
fail as the stack usage has increased under MSAN with clang-20.1.

ASAN takes a 11M stack, however there was no obvious gain in MSAN
test success after 2M.

The resulting behaviour observed on smaller stack size was a
SEGV normally.

Hide the default stack size from the sysvar tests that expose
thread-stack as a variable with its default value.
…ructure

THD::reset_sub_statement_state and THD::restore_sub_staement_state
swap auto_inc_intervals_forced(Discrete_intervals_list) of a THD class
with a local variable temporary to execute other things before restoring
at the end of Table_triggers_list::process_triggers under a
rpl_master_erroneous_autoinc(true) condition as exposed by the
rpl.rpl_trigger test.

The uninitialized data isn't used and the only required action is to
copy the data in one direction. As the intent is for the auto_inc_intervals_forced
value to be overwritten or unused, MEM_UNDEFINED is used on it to
ensure the previous state is considered invalid.

The other uses of reset_sub_statement_state in Item_sp::execute_impl
also follow the same pattern of taking a copy to restore within the
same function.
Calling SetArrayOptions with Nodes[i - 1].Key as its nm argument
exposed a MemorySanitizer error when i=0 for the tests:
* connect.bson_udf
* connect.json_udf
* connect.json_udf_bin

Its assumed that a basic optimization would have eliminated
these invalid expressions.

As the nm argument was unused, it has been removed.
With MSAN the following test behavious where observed:
* funcs_1.myisam_views-big - normal big test for non-debug
* innodb_gis.rtree_purge - normal big test with MSAN
* main.alter_table_lock - very quick - unclear why disabled
* main.cte_recursive - slow on Debug only
* main.join_cache_notasan - special MSAN handing for returning OOM added
* main.sum_distinct-big - 90 seconds on non-debug - still big however
* maria.max_length - normal big test with MSAN
* perfschema.statement_digest_long_query - overflows stack on debug

Timingsi (on old memory constrained hardware):

non-debug:

funcs_1.myisam_views-big                 w2 [ pass ]  78564
innodb_gis.rtree_purge '16k'             w2 [ pass ]   5784
innodb_gis.rtree_purge '32k'             w2 [ pass ]   5242
innodb_gis.rtree_purge '4k'              w1 [ pass ]   8303
innodb_gis.rtree_purge '64k'             w1 [ pass ]   6348
innodb_gis.rtree_purge '8k'              w2 [ pass ]   5870
main.alter_table_lock                    w1 [ pass ]     41
main.cte_recursive                       w1 [ pass ]  15485
main.join_cache_notasan                  w1 [ pass ]     39
main.sum_distinct-big                    w2 [ pass ]  96256
maria.max_length                         w1 [ pass ]  92990
perfschema.statement_digest_long_query   w2 [ pass ]      8

debug:

funcs_1.myisam_views-big                 w1 [ skipped ]  Can't be run WITH_MSAN and CMAKE_BUILD_TYPE=Debug
innodb_gis.rtree_purge '16k'             w2 [ pass ]  109788
innodb_gis.rtree_purge '32k'             w2 [ pass ]  62361
innodb_gis.rtree_purge '4k'              w1 [ pass ]  89423
innodb_gis.rtree_purge '64k'             w1 [ pass ]  72082
innodb_gis.rtree_purge '8k'              w1 [ pass ]  98452
main.alter_table_lock                    w2 [ pass ]     38
main.cte_recursive                       w2 [ pass ]  180047
main.join_cache_notasan                  w1 [ pass ]    166
main.sum_distinct-big                    w1 [ skipped ]  Can't be run WITH_MSAN and CMAKE_BUILD_TYPE=Debug
maria.max_length                         w1 [ skipped ]  Can't be run WITH_MSAN and CMAKE_BUILD_TYPE=Debug
perfschema.statement_digest_long_query   w1 [ skipped ]  Can't be run WITH_MSAN and CMAKE_BUILD_TYPE=Debug
@grooverdan grooverdan force-pushed the msan_debug_innodb branch from 5f4b0fa to 676aea8 Compare May 28, 2025 06:35
@grooverdan grooverdan marked this pull request as ready for review May 28, 2025 06:35
The following mroonga functions had approaching 64k stack frames, so
exclude these:

* chunk_merge - ~60k
* buffer_merge - ~78k
* grn_ii_update_one - ~60k
@dr-m dr-m merged commit 88d35c5 into MariaDB:10.6 May 28, 2025
8 of 9 checks passed
@grooverdan grooverdan deleted the msan_debug_innodb branch May 28, 2025 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MariaDB Foundation Pull requests created by MariaDB Foundation
Development

Successfully merging this pull request may close these issues.

4 participants