-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle exit codes and grandchild processes with envkey-source #3
Comments
As another HN commenter shared, also see these (as examples, or as partner tools to recommend): |
I've been looking into this a bit and it seems like |
They will still be an issue. :) Consider this example: your process (PID1) execs a command (let's call that process PID2, though the actual ID could be different). And PID2 spawns a child process, PID3. Now, PID2 dies unexpectedly. Your It may actually be more challenging if call a Aren't POSIX semantics fun? :P |
Hmm ok, that mostly makes sense. Thanks for explaining. One thing I'm still not understanding is how I'd know when the watched process (or one of its children) spawns its own child process in order to increment the counter. |
Hm, I guess you could poll the table, but it's certain that you will miss some short-lived processes (and those are pretty common). But you don't really need this information. Here's one simple solution. I'm not suggesting it's the best approach, but I don't think it's a wrong one. My hope here is to give you a little model you could sketch on a whiteboard, and wrap your head around this crazy legacy problem you inherited. :) Define a counter, protected by some isolation mechanism (a channel, a mutex, an atomic, etc.). Let's say by a mutex. So every counter operation requires a Lock/Unlock.
(That separate goroutine is your orphan catcher. Ideally it should be as reliable as possible, e.g. maybe a panic should be allowed to bring the whole process down.) The totally non-intuitive part is the "do nothing when waiting on own child." The point here is: you already accounted for that wait, when you decremented the counter at spawn-time. Effectively you said, "the main goroutine just spawned a child, and it's taking responsibility right now for waitpid()'ing exactly one child to balance the books." It follows that all the remaining waits, the ones executed by the "separate / orphan catcher" goroutine, will be orphaned processes. Play it out on paper and see. :) If there are never any orphaned processes, the counter will vary between zero and various random, negative numbers, but will never become positive. Because if you can assume there are no orphans, a SIGCHLD (increment) is always associated to the spawn (decrement) that made it possible. It's been a long day, and I may have made a logical error here -- please reason this through with me -- but it looks right on a second read-through. Play the game a little, see what you think. :) |
I'm sorry. I should have known better than to try & be smart after a long day. :( There's a race condition (edit: really just a bug, not a race) in my algorithm. Please don't burn energy on it. I can give an example later if you're interested -- but for now, my strong advice is either to recommend a partner tool like tini, and punt on the issue; or look at how tini et al. implement it (or again the Rust article I shared). |
Oh really? I was following your logic and it seems quite good! I'm curious where the bug is. When calling Some examples here seem to be accomplishing the same thing in a few different ways: https://golang.hotexamples.com/examples/golang.org.x.sys.unix/-/Wait4/golang-wait4-function-examples.html |
That's the source of the problem actually. :) You can waitpid in many different ways. Two that matter here are: asking for an explicit PID to wait on, and asking for any child process that has completed. Reading the system-call manual page is helpful, e.g., https://linux.die.net/man/2/waitpid, see the comments about non-positive PID values, esp. waitpid(-1).
The bug / race condition in my algorithm is because we are mixing the two methods. The counter method is workable when all waits are of the "don't care which" variety. You're just counting signals, and emitting the right number of waits. But when you mix this with a concurrent process (goroutine) that is doing explicit-PID waits, then under the right circumstances, this sequence of events can happen:
But PID 99 has already been reaped. So the waitpid(99) in the final step will raise an ECHILD error (child doesn't exist). And now our bookkeeping has been upset. For your own program's purposes, you may need to maintain some control over when your child processes come to an end. If you can tolerate having another goroutine (the orphan catcher) doing all the waits, even on your own child processes -- i.e., you never call
As you can see it's literally just counting SIGCHLDs and executing that number of waitpid(-1) calls. But. This will all be WAY simpler if you let there be another process above yours, doing all this work. :) The dumb-init / tini pattern is a smart one. I worry that you're going to lose focus on your product features by getting lost in obscure process-management issues. |
Thanks @gmfawcett, I'll take your advice and suggest one of these tools for the time being. |
As described in this comment: https://news.ycombinator.com/item?id=30858713
envkey-source needs to properly handle
SIGCHLD
signals on unix to avoid potential zombie processes.The text was updated successfully, but these errors were encountered: