Skip to content

Panic hook can result in unexpected access to OCaml runtime from another thread #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lupus opened this issue Jun 17, 2023 · 9 comments

Comments

@Lupus
Copy link
Contributor

Lupus commented Jun 17, 2023

I'm spawning some tokio task from my #[ocaml::func] Rust function, that is called from OCaml. The task itself does not call any OCaml code inside, but when a panic happens inside that task, I see surprising errors in valgrind before the whole thing segfaults:

==318905== Thread 4 tokio-runtime-w:
==318905== Invalid write of size 4
==318905==    at 0xB32DAA: caml_memprof_set_suspended (memprof.c:561)
==318905==    by 0xB2B21F: caml_fatal_uncaught_exception (printexc.c:149)
==318905==    by 0xB14EF7: caml_startup (startup_nat.c:174)
==318905==    by 0xB14EF7: caml_main (startup_nat.c:179)
==318905==    by 0x585E4B: main (main.c:37)
==318905==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Tokio runtime worker thread trying to call caml_fatal_uncaught_exception, what could possibly go wrong 😂

I'm not yet sure on why this is happening, I see that ocaml-rs is setting up that panic hook in initial setup [1], probably rust takes the hook from the thread, that created tokio task?

[1] https://github.com/zshipko/ocaml-rs/blob/master/src/macros.rs#L62

@zshipko
Copy link
Owner

zshipko commented Jun 17, 2023

Interesting, there is certainly a lot that can go wrong with the combination of async, panics and OCaml exceptions!

I don't know anything about how tokio manages panics and am not even sure where to start looking into this issue. Does disabling the ocaml-rs panic hook get you a more helpful backtrace?

@Lupus
Copy link
Contributor Author

Lupus commented Jun 17, 2023

I'll try to come up with some repro later, chasing some other segfault at the moment 😵

@Lupus
Copy link
Contributor Author

Lupus commented Dec 17, 2024

A-a-and I stumbled upon this again...

==554404== Thread 2 olwti-tokio-0:
==554404== Invalid write of size 4
==554404==    at 0x1F0E04A: caml_memprof_set_suspended (memprof.c:561)
==554404==    by 0x1F064BF: caml_fatal_uncaught_exception (printexc.c:149)
==554404==    by 0x1EF0257: caml_raise (fail_nat.c:76)
==554404==    by 0x1EF0321: caml_raise_with_arg (fail_nat.c:103)
==554404==    by 0x1EF0482: caml_failwith_value (fail_nat.c:137)
==554404==    by 0x1781357: ocaml::error::Error::raise_failure (error.rs:137)
==554404==    by 0x1780541: ocaml::macros::initial_setup::{{closure}} (macros.rs:30)
==554404==    by 0x1CEF8E7: call<(&std::panic::PanicHookInfo), (dyn core::ops::function::Fn<(&std::panic::PanicHookInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global> (boxed.rs:2468)
==554404==    by 0x1CEF8E7: std::panicking::rust_panic_with_hook (panicking.rs:809)
==554404==    by 0x1CEF6A9: std::panicking::begin_panic_handler::{{closure}} (panicking.rs:674)
==554404==    by 0x1CED268: std::sys::backtrace::__rust_end_short_backtrace (backtrace.rs:170)
==554404==    by 0x1CEF33B: rust_begin_unwind (panicking.rs:665)
==554404==    by 0x11E736F: core::panicking::panic_fmt (panicking.rs:74)
==554404==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Panic handlers are installed globally and affect all Rust threads, even those not currently (or even at all) running OCaml code, which leads to unsynchronized access to OCaml runtime, resulting in issues like in above valgrind report.

Ideally I think there should be a thread-local stack of guards, which must exist only when we're inside OCaml code, or in a context where raising OCaml exception would be appropriate behavior. Panic handler should apply its current behavior iff there is an active guard in the stack on current thread, and do nothing otherwise.

@Lupus
Copy link
Contributor Author

Lupus commented Dec 17, 2024

"no-panic-hook" feature seems not very helpful here, as I have to control all dependencies which use ocaml crate to ensure this feature is set there, otherwise some default will just enable panic hook and I would get random crashes from other Rust threads doing panics.

@Lupus
Copy link
Contributor Author

Lupus commented Dec 17, 2024

Quick idea: maybe runtime handle should include that panic hook guard, as where runtime handle is available, it is assumed that we're executing within OCaml context (called from OCaml via stubs or running on active OCaml domain thread).

Lupus added a commit to Lupus/ocaml-lwt-interop that referenced this issue Dec 17, 2024
@Lupus
Copy link
Contributor Author

Lupus commented Dec 17, 2024

Something along these lines maybe

use std::cell::Cell;
use std::panic;
use std::sync::Once;

thread_local! {
    static GUARD_COUNT: Cell<usize> = Cell::new(0);
}

static INIT: Once = Once::new();

pub struct OcamlPanicGuard;

impl OcamlPanicGuard {
    pub fn new() -> Self {
        INIT.call_once(|| {
            let original_hook = panic::take_hook();
            panic::set_hook(Box::new(move |panic_info| {
                if GUARD_COUNT.with(|count| count.get()) > 0 {
                    let err = panic_info.payload();
                    let msg = if let Some(s) = err.downcast_ref::<&str>() {
                        s.to_string()
                    } else if let Some(s) = err.downcast_ref::<String>() {
                        s.clone()
                    } else {
                        format!("{:?}", err)
                    };
                    ocaml::Error::raise_failure(&msg);
                } else {
                    original_hook(panic_info);
                }
            }));
        });

        GUARD_COUNT.with(|count| count.set(count.get() + 1));
        OcamlPanicGuard
    }
}

impl Drop for OcamlPanicGuard {
    fn drop(&mut self) {
        GUARD_COUNT.with(|count| count.set(count.get() - 1));
    }
}

@zshipko
Copy link
Owner

zshipko commented Dec 18, 2024

thanks for investigating this! i don't have a lot of time to work on a fix right now, but would be happy to review a PR if you come up with something.

@zshipko zshipko mentioned this issue Feb 5, 2025
@zshipko
Copy link
Owner

zshipko commented Feb 7, 2025

Something along these lines maybe

...

This was included in v1.2.0 (see #160) - let me know if that update solves the problem for you and I will close this issue!

@Lupus
Copy link
Contributor Author

Lupus commented Feb 23, 2025

Thanks for looking into this! Will definitely try this out, but can't promise to do it soon. We've disabled panic hook so far and mitigated the problem, but having panic hook is actually nice and useful inside the binding functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants