Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rr hangs waiting on a process to exit for a process that has already exited when running Firefox #3727

Closed
asutherland opened this issue Apr 17, 2024 · 7 comments

Comments

@asutherland
Copy link

I've recently been experiencing a consistent problem where rr appears to hang when tracing Firefox through normal mach mochitest and mach run workflows on both a "AMD Ryzen Threadripper PRO 5975WX 32-Cores" with the zen workaround script having run where all rr tests passed (after a one-off failure of "#2040: x86/ptrace-32" when run under ctest -12 where it seemed like the Ubuntu crash reporter may have inserted itself into things, but which passed on re-run) as well as my previous "Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz" machine. Attempting to ask rr to terminate nicely via a SIGTERM resulted in a broken trace (rr replay -a was sad), as did the more extreme SIGKILL.

The process tree inevitably ends up looking like:

4251   │            |               |-rr(979526)-+-{TraceeAttention}(979527)                                                                                                                                          
4252   │            |               |            |-{compress data}(979529)                                                                                                                                            
4253   │            |               |            |-{compress data}(979530)                                                                                                                                            
4254   │            |               |            |-{compress data}(979531)                                                                                                                                            
4255   │            |               |            |-{compress data}(979532)                                                                                                                                            
4256   │            |               |            |-{compress data}(979533)                                                                                                                                            
4257   │            |               |            |-{compress data}(979534)                                                                                                                                            
4258   │            |               |            |-{compress data}(979535)                                                                                                                                            
4259   │            |               |            |-{compress data}(979536)                                                                                                                                            
4260   │            |               |            |-{compress events}(979528)                                                                                                                                          
4261   │            |               |            |-{compress mmaps}(979537)                                                                                                                                           
4262   │            |               |            `-{compress tasks}(979538)     

The problem disappeared when, under @khuey's direction I ran with RR_LOG=all:debug. I got 3 clean runs (rr terminated normally and rr replay -a could be used) consecutively in this case, so I switched to performing a debug log of Scheduler:all where I got a clean run but then started getting bad runs.

Under this mode, a steady state log spam of the following happens:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1774968 is waiting to exit; checking status ...       
[Scheduler]   still blocked       
[Scheduler]   all tasks blocked, waiting for runnable (106019 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule    

I then captured a log where the sad process seems to be "1776030". The transition to the broken state looked like:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776022 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776022
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019 is blocked on SIGNAL_DELIVERY: SIGSYS(det); checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776019
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019  was already stopped with status 0x9 (FATAL-SIGKILL)
[Scheduler]   Carrying on with task 1776019
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138502 total)
[Scheduler]   1776022 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776019 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776018 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776017 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776016 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1776014 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler]     ... but it's dead
[Scheduler]   1775990 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775990 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138502 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138501 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138501 total)
[Scheduler]   no child to wait for

And then looking back to the log messages related to 1776030 that seemed to lead to the state are (chronologically older to newer):

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler] wait on 1776030 returns 0x857f (SYSCALL)
[Scheduler] Stopping 1776030
[Scheduler]   ready with status 0x857f (SYSCALL)
[Scheduler]   selecting task 1776030
[Scheduler] Switching from 1775990(Socket Process) to 1776030(ProfilerChild) (priority 0 to 0) at 210077
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (1776030 is un-switchable at SYSCALL: futex)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775974 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775978 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is (none) 
[Scheduler]   1776030 isn't blocked
[Scheduler]   Carrying on with task 1776030
[Scheduler] Starting 1776030
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (1776030 is un-switchable at (none))
[Scheduler]   and running; waiting for state change 
[Scheduler] Stopping 1776030
[Scheduler]   new status is 0x7057f (PTRACE_EVENT_SECCOMP)
[Scheduler] Starting 1776030
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775974 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1775978 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule
[Scheduler] Switching from 1776030(ProfilerChild) to 1775990(Socket Process) (priority 0 to 0) at 210081
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is blocked on SYSCALL: futex; checking status ...
[Scheduler]   still blocked
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked

(that gets repeated a bunch), then

[Scheduler] Task event is SIGNAL_DELIVERY: SIGSYS(det)
[Scheduler]   1776019 is blocked on SIGNAL_DELIVERY: SIGSYS(det); checking status ...
[Scheduler]   still blocked
[Scheduler]   all tasks blocked, waiting for runnable (138508 total)
[Scheduler]   1776030 changed status to 0x9 (FATAL-SIGKILL)
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   need to reschedule
[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked
[Scheduler]   selecting task 1776016
[Scheduler] Switching from 1776030(???) to 1776016(BHMgr Monitor) (priority 0 to 0) at 210744

then it seems like it's forever

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1776030 is waiting to exit; checking status ...
[Scheduler]   still blocked

@khuey suggested I file an issue and that @rocallahan might have the most expertise for a problem like this.

@khuey
Copy link
Collaborator

khuey commented Apr 17, 2024

What does /proc/1776030/status (or the equivalent for your next run) show? What's /proc/1776030/stack?

@asutherland
Copy link
Author

In my current reproduction, 1810308 is reported as sad.

 $  cat /proc/1810308/status
cat: /proc/1810308/status: No such file or directory

pstree -lpt for the ./mach run invocation:

5331   │            |               |                    |-bash(712338)-+-python3(1810141)-+-python(1810199)
5332   │            |               |                    |              |                  |-rr(1810219)-+-{TraceeAttention}(1810220)
5333   │            |               |                    |              |                  |             |-{compress data}(1810222)
5334   │            |               |                    |              |                  |             |-{compress data}(1810223)
5335   │            |               |                    |              |                  |             |-{compress data}(1810224)
5336   │            |               |                    |              |                  |             |-{compress data}(1810225)
5337   │            |               |                    |              |                  |             |-{compress data}(1810226)
5338   │            |               |                    |              |                  |             |-{compress data}(1810227)
5339   │            |               |                    |              |                  |             |-{compress data}(1810228)
5340   │            |               |                    |              |                  |             |-{compress data}(1810229)
5341   │            |               |                    |              |                  |             |-{compress events}(1810221)
5342   │            |               |                    |              |                  |             |-{compress mmaps}(1810230)
5343   │            |               |                    |              |                  |             `-{compress tasks}(1810231)
5344   │            |               |                    |              |                  |-{glean.dispatche}(1810197)
5345   │            |               |                    |              |                  `-{python3}(1810195)

@khuey
Copy link
Collaborator

khuey commented Apr 17, 2024

Sorry, I should have been more clear. In your Scheduler:debug log it shows rr is spinning its wheels on a particular process. What are /proc/pid/status and /proc/pid/stack for that process?

@asutherland
Copy link
Author

asutherland commented Apr 17, 2024

I should have been more clear too ;) The process is gone. There is no such process in the system anymore. This is the current log that it's spinning on despite that:

[Scheduler] Task event is SYSCALL: futex
[Scheduler]   1810308 is waiting to exit; checking status ...       
[Scheduler]   still blocked       
[Scheduler]   all tasks blocked, waiting for runnable (108962 total)
[Scheduler]   no child to wait for
[Scheduler] Scheduling next task (ALLOW_SWITCH)
[Scheduler]   need to reschedule       

@khuey
Copy link
Collaborator

khuey commented Apr 17, 2024

Oh, the process has actually exited? That's exciting.

@asutherland
Copy link
Author

It turns out using RR_LOG=all:debug is not a reliable way to mitigate things and/or it makes things very, very slow.

The following YOLO patch seems to work sufficiently well for me and did not immediately set everything on fire[1]:

diff --git a/src/Scheduler.cc b/src/Scheduler.cc
index 42c9f42c..15498337 100644
--- a/src/Scheduler.cc
+++ b/src/Scheduler.cc
@@ -331,7 +331,7 @@ bool Scheduler::is_task_runnable(RecordTask* t, WaitAggregator& wait_aggregator,
     }
   }
 
-  if (t->waiting_for_ptrace_exit) {
+  if (t->waiting_for_ptrace_exit && !t->was_reaped()) {
     LOGM(debug) << "  " << t->tid << " is waiting to exit; checking status ...";
   } else if (t->is_stopped() || t->was_reaped()) {
     LOGM(debug) << "  " << t->tid << "  was already stopped with status " << t->status();

1: rr replay -a worked for me and ctest -j60 had all tests pass except for 654 - seccomp_desched, but then it passed on re-run via ctest --rerun-failed --output-on-failure. Some test did cause the ubuntu crash reporter to helpfully inform me that "cat" crashed, and that happened twice, with the second time coming after the re-run, but it does not keep happening on re-run, so presumably that was just other tests and the ubuntu crash reporter is not fast.

@asutherland
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants