Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes when unwinding the stack from a signal handler interrupting deallocation #189

Open
mbautin opened this issue Jun 16, 2023 · 1 comment

Comments

@mbautin
Copy link

mbautin commented Jun 16, 2023

After we upgraded YugabyteDB codebase from Gperftools tcmalloc to this version, we encountered the following type of crashes:

(lldb) target create "tests-util/debug-util-test" --core "core.92253"
Core file '/home/mbautin/code/yugabyte-db4/build/latest/core.92253' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'debug-util-test', stop reason = signal SIGSEGV
  * frame #0: 0x00007fd65a231acf libgcc_s.so.1`uw_frame_state_for + 1055
    frame #1: 0x00007fd65a233758 libgcc_s.so.1`_Unwind_Backtrace + 104
    frame #2: 0x00007fd65a577c56 libc.so.6`__backtrace + 102
    frame #3: 0x00007fd65c07c5a5 libyb_util.so`yb::StackTrace::Collect(this=0x00007fd653da4120, skip_frames=2) at debug-util.cc:433:17
    frame #4: 0x00007fd65c274385 libyb_util.so`yb::(anonymous namespace)::HandleStackTraceSignal(signum=12) at stack_trace.cc:183:15
    frame #5: 0x00007fd65a48ab20 libc.so.6`__restore_rt
    frame #6: 0x000055894739b5c8 debug-util-test`TcmallocSlab_Internal_PopBatch_trampoline
(lldb) bt
* thread #1, name = 'debug-util-test', stop reason = signal SIGSEGV
  * frame #0: 0x00007f8ce502aacf libgcc_s.so.1`uw_frame_state_for + 1055
    frame #1: 0x00007f8ce502c758 libgcc_s.so.1`_Unwind_Backtrace + 104
    frame #2: 0x00007f8ce5370c56 libc.so.6`__backtrace + 102
    frame #3: 0x00007f8ce6e755a5 libyb_util.so`yb::StackTrace::Collect(this=0x00007f8cdfb9f0e0, skip_frames=2) at debug-util.cc:433:17
    frame #4: 0x00007f8ce706d385 libyb_util.so`yb::(anonymous namespace)::HandleStackTraceSignal(signum=12) at stack_trace.cc:183:15
    frame #5: 0x00007f8ce5283b20 libc.so.6`__restore_rt
    frame #6: 0x000055f6fb79a40d debug-util-test`tcmalloc_internal_tls_fetch_pic + 77
    frame #7: 0x000055f6fb75639c debug-util-test`tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Overflow(void*, unsigned long, int) + 252
    frame #8: 0x000055f6fb737c62 debug-util-test`operator delete(void*) + 1122
    frame #9: 0x000055f6fb6e712c debug-util-test`yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody(this=0x000024fd7eaa3650)::Entry::~Entry() at debug-util-test.cc:345:9
    frame #10: 0x000055f6fb6ebf75 debug-util-test`yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody(this=0x000024fd7fcab590)::$_0::operator()() const at debug-util-test.cc:385:11
    frame #11: 0x000055f6fb6ebd1f debug-util-test`void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(this=0x000024fd7fcab588)::$_0 const&)::'lambda'()::operator()() const at test_thread_holder.h:62:7
    frame #12: 0x000055f6fb6ebcb5 debug-util-test`decltype(__f=0x000024fd7fcab588)::$_0>()()) std::__1::__invoke[abi:v160003]<void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0&&) at invoke.h:394:23
    frame #13: 0x000055f6fb6ebc8d debug-util-test`void std::__1::__thread_execute[abi:v160003]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>(__t=0x000024fd7fcab580, (null)=__tuple_indices<> @ 0x00007f8cdfba05a8)::$_0, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>&, std::__1::__tuple_indices<>) at thread:282:5
    frame #14: 0x000055f6fb6ebab2 debug-util-test`void* std::__1::__thread_proxy[abi:v160003]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void yb::TestThreadHolder::AddThreadFunctor<yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0>(yb::DebugUtilTest_TestStackTraceSignalDuringAllocation_Test::TestBody()::$_0 const&)::'lambda'()>>(__vp=0x000024fd7fcab580) at thread:293:5
    frame #15: 0x00007f8ce56021cf libpthread.so.0`start_thread + 239
    frame #16: 0x00007f8ce526edd3 libc.so.6`__clone + 67

We have a stack trace dump facility that sends signal to threads and causes them to capture their stacks. This is being done using the backtrace Linux function that uses libunwind internally. We did not have any problems with this approach with Gperftools tcmalloc, but with this tcmalloc we are getting segmentation faults in case tcmalloc code is interrupted in functions such as tcmalloc_internal_tls_fetch_pic or TcmallocSlab_Internal_PopBatch_trampoline. We have a unit test that reliably reproduces this situation by creating a few threads that allocate objects and pass them to other threads for deallocation, while the main thread is repeatedly trying to dump the stacks of those worker threads.

As far as I know, libunwind backtrace facility is async-safe and is suitable for use in a signal handler. We are currently using LLVM 15's version of libunwind.

Has anyone else encountered this issue and is there a known workaround?

@mbautin
Copy link
Author

mbautin commented Jun 16, 2023

We are currently using this fork of tcmalloc: https://github.com/yugabyte/tcmalloc/tree/e116a66-yb (based on commit e116a66 with some build-related changes).

mbautin added a commit to yugabyte/yugabyte-db that referenced this issue Jun 23, 2023
Summary:
When trying to capture a stack trace with a signal handler, if a memory allocation/deallocation is happening in the thread receiving the signal, the process could crash. Google TCMalloc issue: google/tcmalloc#189.

In this diff, we are using the IsCurThreadInAllocDealloc malloc extension API we added in yugabyte/tcmalloc@677ba2d to skip capturing the stack trace in case the signal interrupted a thread that is currently allocating or deallocating memory. In such cases, we produce an empty stack trace which is later omitted from the overall threads dump. #17889 is a follow-up issue for retrying obtaining stack traces in such cases.

Another change contained in the TCMalloc version that we are upgrading to is yugabyte/tcmalloc@d1b0e69 (adding an option to not seed lifetime profiler with live allocations). We are now setting seed_with_live_allocs to false when capturing an allocation profile.

Test Plan: Jenkins

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D26349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant