rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

asutherland · 2024-04-17T19:23:05Z

I've recently been experiencing a consistent problem where rr appears to hang when tracing Firefox through normal mach mochitest and mach run workflows on both a "AMD Ryzen Threadripper PRO 5975WX 32-Cores" with the zen workaround script having run where all rr tests passed (after a one-off failure of "#2040: x86/ptrace-32" when run under ctest -12 where it seemed like the Ubuntu crash reporter may have inserted itself into things, but which passed on re-run) as well as my previous "Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz" machine. Attempting to ask rr to terminate nicely via a SIGTERM resulted in a broken trace (rr replay -a was sad), as did the more extreme SIGKILL.

The process tree inevitably ends up looking like:

4251   │            |               |-rr(979526)-+-{TraceeAttention}(979527)                                                                                                                                          
4252   │            |               |            |-{compress data}(979529)                                                                                                                                            
4253   │            |               |            |-{compress data}(979530)                                                                                                                                            
4254   │            |               |            |-{compress data}(979531)                                                                                                                                            
4255   │            |               |            |-{compress data}(979532)                                                                                                                                            
4256   │            |               |            |-{compress data}(979533)                                                                                                                                            
4257   │            |               |            |-{compress data}(979534)                                                                                                                                            
4258   │            |               |            |-{compress data}(979535)                                                                                                                                            
4259   │            |               |            |-{compress data}(979536)                                                                                                                                            
4260   │            |               |            |-{compress events}(979528)                                                                                                                                          
4261   │            |               |            |-{compress mmaps}(979537)                                                                                                                                           
4262   │            |               |            `-{compress tasks}(979538)

The problem disappeared when, under @khuey's direction I ran with RR_LOG=all:debug. I got 3 clean runs (rr terminated normally and rr replay -a could be used) consecutively in this case, so I switched to performing a debug log of Scheduler:all where I got a clean run but then started getting bad runs.

Under this mode, a steady state log spam of the following happens:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1774968 is waiting to exit; checking status ...       
[Scheduler]   still blocked       
[Scheduler]   all tasks blocked, waiting for runnable (106019 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule

I then captured a log where the sad process seems to be "1776030". The transition to the broken state looked like:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776022 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776022
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019 is blocked on SIGNAL_DELIVERY: SIGSYS(det); checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776019
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019  was already stopped with status 0x9 (FATAL-SIGKILL)
[Scheduler]   Carrying on with task 1776019
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138502 total)
[Scheduler]   1776022 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776019 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776018 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776017 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776016 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776014 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1775990 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138502 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138501 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138501 total)
[Scheduler]   no child to wait for

And then looking back to the log messages related to 1776030 that seemed to lead to the state are (chronologically older to newer):

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler] wait on 1776030 returns 0x857f (SYSCALL)
[Scheduler] Stopping 1776030
[Scheduler]   ready with status 0x857f (SYSCALL)
[Scheduler]   selecting task 1776030
[Scheduler] Switching from 1775990(Socket Process) to 1776030(ProfilerChild) (priority 0 to 0) at 210077
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (1776030 is un-switchable at SYSCALL: futex)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775974 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775978 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is (none) 
[Scheduler]   1776030 isn't blocked
[Scheduler]   Carrying on with task 1776030
[Scheduler] Starting 1776030
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (1776030 is un-switchable at (none))
[Scheduler]   and running; waiting for state change 
[Scheduler] Stopping 1776030
[Scheduler]   new status is 0x7057f (PTRACE_EVENT_SECCOMP)
[Scheduler] Starting 1776030
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775974 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775978 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule

[Scheduler] Switching from 1776030(ProfilerChild) to 1775990(Socket Process) (priority 0 to 0) at 210081

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked

(that gets repeated a bunch), then

[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019 is blocked on SIGNAL_DELIVERY: SIGSYS(det); checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138508 total)
[Scheduler]   1776030 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776016
[Scheduler] Switching from 1776030(???) to 1776016(BHMgr Monitor) (priority 0 to 0) at 210744

then it seems like it's forever

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked

@khuey suggested I file an issue and that @rocallahan might have the most expertise for a problem like this.

The text was updated successfully, but these errors were encountered:

khuey · 2024-04-17T19:24:28Z

What does /proc/1776030/status (or the equivalent for your next run) show? What's /proc/1776030/stack?

asutherland · 2024-04-17T19:27:52Z

In my current reproduction, 1810308 is reported as sad.

 $  cat /proc/1810308/status
cat: /proc/1810308/status: No such file or directory

pstree -lpt for the ./mach run invocation:

5331   │            |               |                    |-bash(712338)-+-python3(1810141)-+-python(1810199)
5332   │            |               |                    |              |                  |-rr(1810219)-+-{TraceeAttention}(1810220)
5333   │            |               |                    |              |                  |             |-{compress data}(1810222)
5334   │            |               |                    |              |                  |             |-{compress data}(1810223)
5335   │            |               |                    |              |                  |             |-{compress data}(1810224)
5336   │            |               |                    |              |                  |             |-{compress data}(1810225)
5337   │            |               |                    |              |                  |             |-{compress data}(1810226)
5338   │            |               |                    |              |                  |             |-{compress data}(1810227)
5339   │            |               |                    |              |                  |             |-{compress data}(1810228)
5340   │            |               |                    |              |                  |             |-{compress data}(1810229)
5341   │            |               |                    |              |                  |             |-{compress events}(1810221)
5342   │            |               |                    |              |                  |             |-{compress mmaps}(1810230)
5343   │            |               |                    |              |                  |             `-{compress tasks}(1810231)
5344   │            |               |                    |              |                  |-{glean.dispatche}(1810197)
5345   │            |               |                    |              |                  `-{python3}(1810195)

khuey · 2024-04-17T19:30:27Z

Sorry, I should have been more clear. In your Scheduler:debug log it shows rr is spinning its wheels on a particular process. What are /proc/pid/status and /proc/pid/stack for that process?

asutherland · 2024-04-17T19:31:58Z

I should have been more clear too ;) The process is gone. There is no such process in the system anymore. This is the current log that it's spinning on despite that:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1810308 is waiting to exit; checking status ...       
[Scheduler]   still blocked       
[Scheduler]   all tasks blocked, waiting for runnable (108962 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule

khuey · 2024-04-17T19:34:14Z

Oh, the process has actually exited? That's exciting.

asutherland · 2024-05-17T05:15:35Z

It turns out using RR_LOG=all:debug is not a reliable way to mitigate things and/or it makes things very, very slow.

The following YOLO patch seems to work sufficiently well for me and did not immediately set everything on fire[1]:

diff --git a/src/Scheduler.cc b/src/Scheduler.cc
index 42c9f42c..15498337 100644
--- a/src/Scheduler.cc
+++ b/src/Scheduler.cc
@@ -331,7 +331,7 @@ bool Scheduler::is_task_runnable(RecordTask* t, WaitAggregator& wait_aggregator,
     }
   }
 
-  if (t->waiting_for_ptrace_exit) {
+  if (t->waiting_for_ptrace_exit && !t->was_reaped()) {
     LOGM(debug) << "  " << t->tid << " is waiting to exit; checking status ...";
   } else if (t->is_stopped() || t->was_reaped()) {
     LOGM(debug) << "  " << t->tid << "  was already stopped with status " << t->status();

1: rr replay -a worked for me and ctest -j60 had all tests pass except for 654 - seccomp_desched, but then it passed on re-run via ctest --rerun-failed --output-on-failure. Some test did cause the ubuntu crash reporter to helpfully inform me that "cat" crashed, and that happened twice, with the second time coming after the re-run, but it does not keep happening on re-run, so presumably that was just other tests and the ubuntu crash reporter is not fast.

asutherland · 2024-05-18T20:09:10Z

Thank you!

rocallahan closed this as completed in 5bac165 May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

asutherland commented Apr 17, 2024

khuey commented Apr 17, 2024

asutherland commented Apr 17, 2024

khuey commented Apr 17, 2024

asutherland commented Apr 17, 2024 •

edited

khuey commented Apr 17, 2024

asutherland commented May 17, 2024

asutherland commented May 18, 2024

rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

Comments

asutherland commented Apr 17, 2024

khuey commented Apr 17, 2024

asutherland commented Apr 17, 2024

khuey commented Apr 17, 2024

asutherland commented Apr 17, 2024 • edited

khuey commented Apr 17, 2024

asutherland commented May 17, 2024

asutherland commented May 18, 2024

asutherland commented Apr 17, 2024 •

edited