__errno_location crash after call to __cxa_guard_acquire #2268

MagnusS · 2024-08-28T09:58:06Z

When enabling SMP the initial call to RNG::get().init() in apic_revenant crashes. RNG::get().init() returns a static variable that is initialised by the first call to get() (here). After stepping through the code I see that the crash is caused by a surrounding call to __cxa_guard_acquire that protects the initialisation of the static variable, which then calls a syscall that returns ENOSYS. This value is written to __errno_location by musl, which then triggers a CPU exception.

There seems to be multiple issues here:

__cxa_guard_acquire is added by the compiler to protect the initialisation of static variables at runtime (such as in RNG). The implementation is here. Depending on whether we have threads enabled in LIBCXX it will call various locking mechanisms in libc/musl, which have to be initialised first. If this happens early in the boot process, the kernel/musl may be unable to handle them. I'm not sure what the best approach to fix this would be, but ideally we shouldn't use code that relied on this feature until we get far enough in the boot process.

__errno_location isn't set up properly, at least not on other cores than core 0, and syscalls will write to the memory location it points to. In combination with the __cxa_guard_acquire issue above, this may happen unexpectedly. After a call to syscall, musl's syscall_ret writes errno to the location stored in __errno_location function from the pthread struct. The struct is retrieved from %fs:0 here. It looks like TLS may not be properly set up, there's a function in IncludeOS to do it here -- but it doesn't look like it's called from apic_revenant.

I'm not sure why this has worked previously, but perhaps the new toolchain/build system happens to make errno_val point outside valid memory, so we now get a crash instead of potential instability? Could this have been causing issues such as #2252?

To reproduce, first enable SMP, then run nix-shell --argstr unikernel test/kernel/integration/smp --run ./test.py

(this is on v0.16.0-release branch)

The text was updated successfully, but these errors were encountered:

fwsGonzo · 2024-08-28T10:13:57Z

You can compile a module with -fno-threadsafe-statics to solve the constructor issue.

MagnusS · 2024-08-28T10:57:57Z

You can compile a module with -fno-threadsafe-statics to solve the constructor issue.

I guess that could at least avoid the calls during the boot process, we just have to be careful about potential races.

@fwsGonzo do you know why TLS may not be initialised properly? Was it removed at some point?

fwsGonzo · 2024-08-28T12:20:40Z

Sadly, I don't remember - just keep in mind that global construction order is one of the big fiascos of C++:

https://en.cppreference.com/w/cpp/language/siof

Maybe the RNG can be initialized lazily. BSS will at least be zero, and since we have to re-seed it occasionally, just making it have to reseed on first use may be a way forward?

alfreb · 2024-08-29T06:14:41Z

I like that idea of lazy RNG - we might even want to make it a plugin or driver. I think in our case the global construction order is not the main issue, but rather that we can't depend on any constructors being called until libc is fully initialized - and we now know that cxa_guard is one of the reasons. We do decide when to initialize, and can also initialize different translation units in the order we prefer, via the linker script trick of putting them in different sections. But yea, I think we need to extend the rules:

No syscalls can be allowed before libc is initialized
No dependency on static class members or global constructors until libc is initialized
No exceptions can be thrown before libc is initialized (I think they depend on global ctors, but definitely on allocation)

There may be more rules - and I don't think we can enforce them automatically - but it should be a small subset of the kernel features that needs to be ready before we can initialize libc, so we should make sure to clearly identify that. One way of doing it would be to only allow pure C, but I think we can use a pretty decent subset of C++, including PMR allocators.

MagnusS · 2024-09-04T11:47:06Z

After thinking a bit more about this I think it would make sense to compile the kernel with -fno-threadsafe-statics to avoid the lock guard for now. There's only a single user until we enable SMP (but then we have to lock manually). Perhaps the service could still be compiled with threadsafety on?

I added this in #2273 (see PR and commits for more details). We could change this if there's a better approach for delaying initialisation later.

fwsGonzo · 2024-09-04T11:55:18Z

No need to compile the entire OS with no-threadsafe-statics. You can use this in CMake:

set_source_files_properties(file.cpp PROPERTIES
	COMPILE_DEFINITIONS MYDEF=1)

And

set_source_files_properties(file.cpp
	PROPERTIES COMPILE_FLAGS -fno-builtin)

I think a lot of footguns have been removed when syscalls panic during boot, so being selective here is fine.

MagnusS · 2024-09-04T13:08:11Z

My reasoning for doing it for the whole kernel in the PR was that if we are disabling it manually for classes that do this anyway, then it's better to not have it on. But I don't have a very strong preference.

If we can turn it on for the service, do you think it would be beneficial to have it selectively in the kernel too? It wouldn't work for anything that is started before libc (and after that we don't have full futex).

MagnusS · 2024-09-07T16:53:09Z

This should be fixed now that #2273 has been merged as we no longer use thread safe statics and disallow syscalls until musl has been initialised.

MagnusS mentioned this issue Aug 28, 2024

Upgrade to latest musl 1.2.5 (via 1.1.24) #2251

Merged

MagnusS mentioned this issue Sep 4, 2024

Disallow syscalls before libc is initialised, remove all existing early boot syscalls #2273

Merged

MagnusS closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

__errno_location crash after call to __cxa_guard_acquire #2268

__errno_location crash after call to __cxa_guard_acquire #2268

MagnusS commented Aug 28, 2024

fwsGonzo commented Aug 28, 2024

MagnusS commented Aug 28, 2024

fwsGonzo commented Aug 28, 2024

alfreb commented Aug 29, 2024

MagnusS commented Sep 4, 2024

fwsGonzo commented Sep 4, 2024 •

edited

Loading

MagnusS commented Sep 4, 2024

MagnusS commented Sep 7, 2024 •

edited

Loading

__errno_location crash after call to __cxa_guard_acquire #2268

__errno_location crash after call to __cxa_guard_acquire #2268

Comments

MagnusS commented Aug 28, 2024

fwsGonzo commented Aug 28, 2024

MagnusS commented Aug 28, 2024

fwsGonzo commented Aug 28, 2024

alfreb commented Aug 29, 2024

MagnusS commented Sep 4, 2024

fwsGonzo commented Sep 4, 2024 • edited Loading

MagnusS commented Sep 4, 2024

MagnusS commented Sep 7, 2024 • edited Loading

fwsGonzo commented Sep 4, 2024 •

edited

Loading

MagnusS commented Sep 7, 2024 •

edited

Loading