Panic hook can result in unexpected access to OCaml runtime from another thread #138

Lupus · 2023-06-17T12:07:12Z

I'm spawning some tokio task from my #[ocaml::func] Rust function, that is called from OCaml. The task itself does not call any OCaml code inside, but when a panic happens inside that task, I see surprising errors in valgrind before the whole thing segfaults:

==318905== Thread 4 tokio-runtime-w:
==318905== Invalid write of size 4
==318905==    at 0xB32DAA: caml_memprof_set_suspended (memprof.c:561)
==318905==    by 0xB2B21F: caml_fatal_uncaught_exception (printexc.c:149)
==318905==    by 0xB14EF7: caml_startup (startup_nat.c:174)
==318905==    by 0xB14EF7: caml_main (startup_nat.c:179)
==318905==    by 0x585E4B: main (main.c:37)
==318905==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Tokio runtime worker thread trying to call caml_fatal_uncaught_exception, what could possibly go wrong 😂

I'm not yet sure on why this is happening, I see that ocaml-rs is setting up that panic hook in initial setup [1], probably rust takes the hook from the thread, that created tokio task?

[1] https://github.com/zshipko/ocaml-rs/blob/master/src/macros.rs#L62

The text was updated successfully, but these errors were encountered:

zshipko · 2023-06-17T14:55:54Z

Interesting, there is certainly a lot that can go wrong with the combination of async, panics and OCaml exceptions!

I don't know anything about how tokio manages panics and am not even sure where to start looking into this issue. Does disabling the ocaml-rs panic hook get you a more helpful backtrace?

Lupus · 2023-06-17T16:28:43Z

I'll try to come up with some repro later, chasing some other segfault at the moment 😵

Lupus · 2024-12-17T08:17:55Z

A-a-and I stumbled upon this again...

==554404== Thread 2 olwti-tokio-0:
==554404== Invalid write of size 4
==554404==    at 0x1F0E04A: caml_memprof_set_suspended (memprof.c:561)
==554404==    by 0x1F064BF: caml_fatal_uncaught_exception (printexc.c:149)
==554404==    by 0x1EF0257: caml_raise (fail_nat.c:76)
==554404==    by 0x1EF0321: caml_raise_with_arg (fail_nat.c:103)
==554404==    by 0x1EF0482: caml_failwith_value (fail_nat.c:137)
==554404==    by 0x1781357: ocaml::error::Error::raise_failure (error.rs:137)
==554404==    by 0x1780541: ocaml::macros::initial_setup::{{closure}} (macros.rs:30)
==554404==    by 0x1CEF8E7: call<(&std::panic::PanicHookInfo), (dyn core::ops::function::Fn<(&std::panic::PanicHookInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global> (boxed.rs:2468)
==554404==    by 0x1CEF8E7: std::panicking::rust_panic_with_hook (panicking.rs:809)
==554404==    by 0x1CEF6A9: std::panicking::begin_panic_handler::{{closure}} (panicking.rs:674)
==554404==    by 0x1CED268: std::sys::backtrace::__rust_end_short_backtrace (backtrace.rs:170)
==554404==    by 0x1CEF33B: rust_begin_unwind (panicking.rs:665)
==554404==    by 0x11E736F: core::panicking::panic_fmt (panicking.rs:74)
==554404==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Panic handlers are installed globally and affect all Rust threads, even those not currently (or even at all) running OCaml code, which leads to unsynchronized access to OCaml runtime, resulting in issues like in above valgrind report.

Ideally I think there should be a thread-local stack of guards, which must exist only when we're inside OCaml code, or in a context where raising OCaml exception would be appropriate behavior. Panic handler should apply its current behavior iff there is an active guard in the stack on current thread, and do nothing otherwise.

Lupus · 2024-12-17T08:22:25Z

"no-panic-hook" feature seems not very helpful here, as I have to control all dependencies which use ocaml crate to ensure this feature is set there, otherwise some default will just enable panic hook and I would get random crashes from other Rust threads doing panics.

Lupus · 2024-12-17T08:35:27Z

Quick idea: maybe runtime handle should include that panic hook guard, as where runtime handle is available, it is assumed that we're executing within OCaml context (called from OCaml via stubs or running on active OCaml domain thread).

See zshipko/ocaml-rs#138

Lupus · 2024-12-17T09:12:17Z

Something along these lines maybe

use std::cell::Cell;
use std::panic;
use std::sync::Once;

thread_local! {
    static GUARD_COUNT: Cell<usize> = Cell::new(0);
}

static INIT: Once = Once::new();

pub struct OcamlPanicGuard;

impl OcamlPanicGuard {
    pub fn new() -> Self {
        INIT.call_once(|| {
            let original_hook = panic::take_hook();
            panic::set_hook(Box::new(move |panic_info| {
                if GUARD_COUNT.with(|count| count.get()) > 0 {
                    let err = panic_info.payload();
                    let msg = if let Some(s) = err.downcast_ref::<&str>() {
                        s.to_string()
                    } else if let Some(s) = err.downcast_ref::<String>() {
                        s.clone()
                    } else {
                        format!("{:?}", err)
                    };
                    ocaml::Error::raise_failure(&msg);
                } else {
                    original_hook(panic_info);
                }
            }));
        });

        GUARD_COUNT.with(|count| count.set(count.get() + 1));
        OcamlPanicGuard
    }
}

impl Drop for OcamlPanicGuard {
    fn drop(&mut self) {
        GUARD_COUNT.with(|count| count.set(count.get() - 1));
    }
}

zshipko · 2024-12-18T02:02:43Z

thanks for investigating this! i don't have a lot of time to work on a fix right now, but would be happy to review a PR if you come up with something.

zshipko · 2025-02-07T22:49:35Z

Something along these lines maybe

...

This was included in v1.2.0 (see #160) - let me know if that update solves the problem for you and I will close this issue!

Lupus · 2025-02-23T07:09:40Z

Thanks for looking into this! Will definitely try this out, but can't promise to do it soon. We've disabled panic hook so far and mitigated the problem, but having panic hook is actually nice and useful inside the binding functions.

Lupus added a commit to Lupus/ocaml-lwt-interop that referenced this issue Dec 17, 2024

Panic hook in ocaml-rs is considered harmful

504f1a3

See zshipko/ocaml-rs#138

zshipko mentioned this issue Feb 5, 2025

Update boxroot #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic hook can result in unexpected access to OCaml runtime from another thread #138

Panic hook can result in unexpected access to OCaml runtime from another thread #138

Lupus commented Jun 17, 2023 •

edited

Loading

zshipko commented Jun 17, 2023

Lupus commented Jun 17, 2023

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

zshipko commented Dec 18, 2024

zshipko commented Feb 7, 2025

Lupus commented Feb 23, 2025

Panic hook can result in unexpected access to OCaml runtime from another thread #138

Panic hook can result in unexpected access to OCaml runtime from another thread #138

Comments

Lupus commented Jun 17, 2023 • edited Loading

zshipko commented Jun 17, 2023

Lupus commented Jun 17, 2023

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

Lupus commented Dec 17, 2024

zshipko commented Dec 18, 2024

zshipko commented Feb 7, 2025

Lupus commented Feb 23, 2025

Lupus commented Jun 17, 2023 •

edited

Loading