Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore: parallel walker occasionally hangs if visitor panics #3009

Open
1 task done
cosmicexplorer opened this issue Mar 6, 2025 · 0 comments · May be fixed by #3010
Open
1 task done

ignore: parallel walker occasionally hangs if visitor panics #3009

cosmicexplorer opened this issue Mar 6, 2025 · 0 comments · May be fixed by #3010

Comments

@cosmicexplorer
Copy link

cosmicexplorer commented Mar 6, 2025

Please tick this box to confirm you have reviewed the above.

  • I have a different issue.

What version of ripgrep are you using?

master (de4baa1)

How did you install ripgrep?

I am able to reproduce this on master.

What operating system are you using ripgrep on?

Arch Linux (64-bit, x86):

> uname -a
Linux terrestrial-gamma-ray-flash 6.13.5-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 27 Feb 2025 18:09:27 +0000 x86_64 GNU/Linux

Describe your bug.

WalkParallel::visit() will nondeterministically hang if the ParallelVisitor::visit() implementation panics. This also occurs when providing a closure to WalkParallel::run().

What are the steps to reproduce the behavior?

I am working on a minimal fix to this right now (see https://github.com/BurntSushi/ripgrep/compare/master...cosmicexplorer:ripgrep:parallel-panic-quit?expand=1), but I think it will require a change to the parallelization methods we use from crossbeam, so I wanted to create this issue beforehand. I am currently able to reproduce this with the following:

diff --git a/crates/ignore/src/walk.rs b/crates/ignore/src/walk.rs
index d6ea9c2..4387fca 100644
--- a/crates/ignore/src/walk.rs
+++ b/crates/ignore/src/walk.rs
@@ -2221,6 +2221,18 @@ mod tests {
         assert!(!dents[0].path_is_symlink());
     }
 
+    #[test]
+    #[should_panic(expected = "oops!")]
+    fn panic_in_parallel() {
+        let td = tmpdir();
+        wfile(td.path().join("foo.txt"), "");
+
+        WalkBuilder::new(td.path())
+            .threads(40)
+            .build_parallel()
+            .run(|| Box::new(|_| panic!("oops!")));
+    }
+
     #[cfg(unix)] // because symlinks on windows are weird
     #[test]
     fn symlink_loop() {

Apply the above diff, then run the following command:

> cd crates/ignore
> timeout -k 2 1 cargo test --lib -- parallel

I typically have to run this many many times on my very fast laptop to get it to trigger. More reliably, you can run:

> cd crates/ignore
> cargo test

And running all the tests at once makes this trigger much more reliably.

What is the actual behavior?

The process hangs, typically somewhere in the Worker::get_work() method. If I add a print statement within this sub-loop:

loop {
if let Some(v) = self.recv() {
self.activate_worker();
value = Some(v);
break;
}
// Our stack isn't blocking. Instead of burning the
// CPU waiting, we let the thread sleep for a bit. In
// general, this tends to only occur once the search is
// approaching termination.
let dur = std::time::Duration::from_millis(1);
std::thread::sleep(dur);
}
it will print this out endlessly.

This especially seems to occur more often if the panic handler is non-trivial: I first noticed this when using color-eyre, which prints the panic to the console after composing a backtrace.

What is the expected behavior?

The WalkParallel::visit() method should not hang, and should instead propagate the panic.

My first attempt to solve this (see cosmicexplorer@246c162) set the quit_now flag if any worker panics, and checked self.is_quit_now() within the sub-loop. This seems to have made it repro less often, but I am still able to get it to repro regardless, even without modifying the panic handler.

There are some other changes I want to propose to the ParallelVisitor{,Builder} traits as well which may make this easier (such as propagating unwind safety), but I'm currently hoping to solve this without making any changes to the external interface. This may require changing the parallelism primitives we use, but maybe not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant