ignore: parallel walker occasionally hangs if visitor panics #3009

cosmicexplorer · 2025-03-06T00:53:58Z

Please tick this box to confirm you have reviewed the above.

I have a different issue.

What version of ripgrep are you using?

master (de4baa1)

How did you install ripgrep?

I am able to reproduce this on master.

What operating system are you using ripgrep on?

Arch Linux (64-bit, x86):

> uname -a
Linux terrestrial-gamma-ray-flash 6.13.5-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 27 Feb 2025 18:09:27 +0000 x86_64 GNU/Linux

Describe your bug.

WalkParallel::visit() will nondeterministically hang if the ParallelVisitor::visit() implementation panics. This also occurs when providing a closure to WalkParallel::run().

What are the steps to reproduce the behavior?

I am working on a minimal fix to this right now (see https://github.com/BurntSushi/ripgrep/compare/master...cosmicexplorer:ripgrep:parallel-panic-quit?expand=1), but I think it will require a change to the parallelization methods we use from crossbeam, so I wanted to create this issue beforehand. I am currently able to reproduce this with the following:

diff --git a/crates/ignore/src/walk.rs b/crates/ignore/src/walk.rs
index d6ea9c2..4387fca 100644
--- a/crates/ignore/src/walk.rs
+++ b/crates/ignore/src/walk.rs
@@ -2221,6 +2221,18 @@ mod tests {
         assert!(!dents[0].path_is_symlink());
     }
 
+    #[test]
+    #[should_panic(expected = "oops!")]
+    fn panic_in_parallel() {
+        let td = tmpdir();
+        wfile(td.path().join("foo.txt"), "");
+
+        WalkBuilder::new(td.path())
+            .threads(40)
+            .build_parallel()
+            .run(|| Box::new(|_| panic!("oops!")));
+    }
+
     #[cfg(unix)] // because symlinks on windows are weird
     #[test]
     fn symlink_loop() {

Apply the above diff, then run the following command:

> cd crates/ignore
> timeout -k 2 1 cargo test --lib -- parallel

I typically have to run this many many times on my very fast laptop to get it to trigger. More reliably, you can run:

> cd crates/ignore
> cargo test

And running all the tests at once makes this trigger much more reliably.

What is the actual behavior?

The process hangs, typically somewhere in the Worker::get_work() method. If I add a print statement within this sub-loop:

ripgrep/crates/ignore/src/walk.rs

Lines 1695 to 1707 in de4baa1

    
           loop { 
        
               if let Some(v) = self.recv() { 
        
                   self.activate_worker(); 
        
                   value = Some(v); 
        
                   break; 
        
               } 
        
               // Our stack isn't blocking. Instead of burning the 
        
               // CPU waiting, we let the thread sleep for a bit. In 
        
               // general, this tends to only occur once the search is 
        
               // approaching termination. 
        
               let dur = std::time::Duration::from_millis(1); 
        
               std::thread::sleep(dur); 
        
           }

it will print this out endlessly.

This especially seems to occur more often if the panic handler is non-trivial: I first noticed this when using color-eyre, which prints the panic to the console after composing a backtrace.

What is the expected behavior?

The WalkParallel::visit() method should not hang, and should instead propagate the panic.

My first attempt to solve this (see cosmicexplorer@246c162) set the quit_now flag if any worker panics, and checked self.is_quit_now() within the sub-loop. This seems to have made it repro less often, but I am still able to get it to repro regardless, even without modifying the panic handler.

There are some other changes I want to propose to the ParallelVisitor{,Builder} traits as well which may make this easier (such as propagating unwind safety), but I'm currently hoping to solve this without making any changes to the external interface. This may require changing the parallelism primitives we use, but maybe not.

The text was updated successfully, but these errors were encountered:

cosmicexplorer linked a pull request Mar 6, 2025 that will close this issue

ignore: gracefully quit worker threads upon panic in ParallelVisitor #3010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore: parallel walker occasionally hangs if visitor panics #3009

ignore: parallel walker occasionally hangs if visitor panics #3009

cosmicexplorer commented Mar 6, 2025 •

edited

Loading

ignore: parallel walker occasionally hangs if visitor panics #3009

ignore: parallel walker occasionally hangs if visitor panics #3009

Comments

cosmicexplorer commented Mar 6, 2025 • edited Loading

Please tick this box to confirm you have reviewed the above.

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

cosmicexplorer commented Mar 6, 2025 •

edited

Loading