Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor/gen spawn options #9256

Closed

Conversation

NelsonVides
Copy link
Contributor

This comes from my thoughts on https://erlangforums.com/t/passing-gen-servers-start-opts-to-supervisors/4325.

Supervisors are too gen behaviours, and we can also plan for them to for example go into hibernation when appropriate, but currently they don't expose passing start options. One example is for SSL connections, where I can set to hibernate the two SSL connection workers if I know they'll be long-lived but not specially active, but the dynamic supervisor doesn't hibernate and therefore will likely never GC until it dies.

Copy link
Contributor

github-actions bot commented Jan 2, 2025

CT Test Results

    2 files     96 suites   1h 8m 53s ⏱️
2 174 tests 2 126 ✅ 48 💤 0 ❌
2 537 runs  2 487 ✅ 50 💤 0 ❌

Results for commit 2096c7d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@IngelaAndin
Copy link
Contributor

I think this is unnecessary as the tls_dyn_connection_sup is a one_for_all-supervisor which means that it will hibernate.

From supervisor.erl:

init_children(State, StartSpec) ->
    SupName = State#state.name,
    case check_startspec(StartSpec, State#state.auto_shutdown) of
        {ok, Children} ->
            case start_children(Children, SupName) of
                {ok, NChildren} ->
                    %% Static supervisor are not expected to
                    %% have much work to do so hibernate them
                    %% to improve memory handling.
                    {ok, State#state{children = NChildren}, hibernate};
                {error, NChildren, Reason} ->
                    _ = terminate_children(NChildren, SupName),
                    {stop, {shutdown, Reason}}
            end;
        Error ->
            {stop, {start_spec, Error}}
    end.

init_dynamic(State, [StartSpec]) ->
    case check_startspec([StartSpec], State#state.auto_shutdown) of
        {ok, Children} ->
            %% Simple one for one supervisors are expected to
            %% have many children coming and going so do not
            %% hibernate.
	    {ok, dyn_init(State#state{children = Children})};
        Error ->
            {stop, {start_spec, Error}}
    end;

@IngelaAndin IngelaAndin closed this Jan 3, 2025
@NelsonVides
Copy link
Contributor Author

I think this is unnecessary as the tls_dyn_connection_sup is a one_for_all-supervisor which means that it will hibernate.

From supervisor.erl:

init_children(State, StartSpec) ->
    SupName = State#state.name,
    case check_startspec(StartSpec, State#state.auto_shutdown) of
        {ok, Children} ->
            case start_children(Children, SupName) of
                {ok, NChildren} ->
                    %% Static supervisor are not expected to
                    %% have much work to do so hibernate them
                    %% to improve memory handling.
                    {ok, State#state{children = NChildren}, hibernate};
                {error, NChildren, Reason} ->
                    _ = terminate_children(NChildren, SupName),
                    {stop, {shutdown, Reason}}
            end;
        Error ->
            {stop, {start_spec, Error}}
    end.

init_dynamic(State, [StartSpec]) ->
    case check_startspec([StartSpec], State#state.auto_shutdown) of
        {ok, Children} ->
            %% Simple one for one supervisors are expected to
            %% have many children coming and going so do not
            %% hibernate.
	    {ok, dyn_init(State#state{children = Children})};
        Error ->
            {stop, {start_spec, Error}}
    end;

@IngelaAndin good hint, but that doesn't apply here, it will only happen after the initial init. In this case, tls_dyn_connection_sup is started with no children (

init(_) ->
SupFlags = #{strategy => one_for_all,
auto_shutdown => any_significant,
intensity => 0,
period => 3600
},
ChildSpecs = [],
{ok, {SupFlags, ChildSpecs}}.
), and then the children are added dynamically (
try start_dyn_connection_sup(IsErlDist) of
{ok, DynSup} ->
case tls_dyn_connection_sup:start_child(DynSup, sender, SenderOpts) of
{ok, Sender} ->
Args = [Role, Sender | ReceiverOpts],
case tls_dyn_connection_sup:start_child(DynSup, receiver, Args) of
), hence taking the supervisor out of hibernation and never sending it back.

This I also saw by tracing, as described in the post on the forums, the supervisor was indeed not hibernating.

It might actually be a good change to repeat this hibernation logic on all places where a static supervisor finishes (re-)starting children, but that might be a lot more complex change considering the many places and different strategies to restart children, but might also be a better option, wdyt?

Overall one of the two hibernation strategies would be a really good move for supervisors, either the one available in this PR or the one proposed above. I can work on it if you have any preference.

@IngelaAndin
Copy link
Contributor

@NelsonVides This will no longer be true when #9231 is merged.

@NelsonVides
Copy link
Contributor Author

@NelsonVides This will no longer be true when #9231 is merged.

Oooooooooooh, right, I didn't see that PR! Amazing, really amazing! 🤩

@NelsonVides NelsonVides deleted the supervisor/gen_spawn_options branch January 3, 2025 11:50
@NelsonVides
Copy link
Contributor Author

@IngelaAndin though, wait... when you do which_child it'll wake the process from hibernation and it won't send it back again 🤔

@IngelaAndin
Copy link
Contributor

Huum .. I do not know how much garbage that will generate but I guess the it will be a lot less than start up. Will have to think about it.

@NelsonVides
Copy link
Contributor Author

NelsonVides commented Jan 3, 2025

Huum .. I do not know how much garbage that will generate but I guess the it will be a lot less than start up. Will have to think about it.

So in general the issue is with supervisors that are pretty much entirely inactive, however never go to hibernation again. I see three solutions:

  1. Allow the user to provide a hibernation timeout flag as I tried to do in this PR:
    1. Gives the user a lot of control;
    2. Though it'd also introduces some API changes.
  2. Repeat the heuristic strategy from that init call where for all wake-ups on a static supervisor it goes to hibernate immediately again:
    1. Small diff, no API changes: just adding a hibernate flag on all handle_* callback returns where we know it is a static supervisor.
    2. On the edge-case when we do want to interact with a supervisor quickly for a bunch of operations, like here for getting a child immediately after startup, it has the waste of hibernating to immediately wake up the process again. The first option would have avoided that wasted hibernation.
  3. Add a call to manually tell the supervisor to go to sleep.
    1. Boringly simple and very fine-grained.
    2. Maybe ugly?

My thoughts. I have some availability to work on any of them :)

@IngelaAndin
Copy link
Contributor

My thinking right now is that the ssl use case for which_child/2 is going to bee quite a common use case. So maybe it would be nice if there was a new optional flag to supervisor that would make it possible to configure an idle timeout for the static supervisors, so that the application can configure that timeout to an appropriate value so that the hibernation will occur after initial calls to which_child/2. And maybe even occur later too if woken up later but then probably becomes idle again. Keeping the default to hibernate right away as is what happens now. @Maria-12648430, @u3s, @dgud

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Jan 6, 2025

Hi all, happy new year :)

@IngelaAndin (Re-)hibernating after some idle timeout seems to me a good way to go, yes. This could even be allowed for simple_one_for_one supervisors, with the default being infinity instead of 0; the option to go into hibernation is probably pretty useless given the nature of such supervisors, but there is nothing to say that it should not be allowed.

I may be mistaken, but at a quick skim, gen_servers hibernate_after option seems to do all of what we want, and supervisors being gen_servers, we could just pass it through? That is, unless we want to provide for different values for hibernate-after-startup and hibernate-on-idle; in that case, we would have to go for a custom implementation in supervisor itself.

One somewhat aesthetic point is that we would have to put a second meaning on the existing start(SupName, Module, Args)/start_link(SupName, Module, Args) functions to accomodate for options: start(Module, Args, Options)/start_link(Module, Args, Options). Things like that are pretty cumbersome to document with the new documentation system.

@NelsonVides
Copy link
Contributor Author

Hey @Maria-12648430! 👋🏽

I think that being able to pass gen_server flags to the underlying gen_server would be the ideal solution, the user often plans for what it is expected a gen_server to do and pass spawn_opt flags or hibernation flags for example, and that's what I tried to do in this PR, though it indeed makes the start_link/3 a bit weirder unfortunately, but I think the change is worth it 🤔

@IngelaAndin
Copy link
Contributor

Yes I think it should be accomplished using the gen_server flags, although I do not think that the user of a supervisor should need to know that supervisors are gen_servers and I think think supervisors could have a configuration of an idle timeout. I think it feels like there should be an additional supervisor flag that could be called idle_timeout. It could default to infinity that is no timeout. Then static supervisors could do hibernation after init only if the timeout is infinity for backwards compatibility.

@NelsonVides
Copy link
Contributor Author

Fair enough. In that case, of the available gen_server flags, at least for now we only want to provide the idle timeout, right? I'd keep it named hibernate_after for consistency with other OTP behaviours, even if we're not explicitly exposing the gen_server implementation. Are we ok also with overloading start_link/3? If that's all right, I could quickly rework this existing PR to do just that.

@Maria-12648430
Copy link
Contributor

I think what @IngelaAndin meant was an additional option (?) That is, have one option to specify when to go into hibernation after supervisor start, and another to specify when the operating supervisor goes into hibernation when not having received a message for a while. This can't be done by just passing the hibernate_after option to the gen_server layer though...

@IngelaAndin
Copy link
Contributor

@Maria-12648430 No I mean I only want one option to the supervisor, additional supervisor flag, we can call it hibernate_after,
and if it set to an integer we will just behave as a normal idle timeout and not hibernate until the idle timeout occurs. But if hibernate_after has the default value infinity the static supervisors will behave as now and and immediately go into hibernation (which makes sense for a lot of static supervisor) and the simple_one_for_one's will not.

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Jan 7, 2025

@NelsonVides No, I do not think we should overlad start_link/3, I want it to be configured in the supervisor.

Like:

ChildSpecs = [#{id      => undefined,
                restart => temporary,
                type    => supervisor,
                hibernate_after => 1000, 
                start   => {tls_dyn_connection_sup, start_link, []}}],

@NelsonVides
Copy link
Contributor Author

ChildSpecs = [#{id      => undefined,
                restart => temporary,
                type    => supervisor,
                hibernate_after => 1000, 
                start   => {tls_dyn_connection_sup, start_link, []}}],

Oh, so changing childspecs actually, that's more changes I'd have imagined. What about the case when the children is not a supervisor but a worker? Workers are of many types and have many APIs, so this flag would then make sense only for when the child is a supervisor, which is a new thing in these specs, all other flags are valid for both supervisors and workers, is that ok? 🤔

@IngelaAndin
Copy link
Contributor

@NelsonVides Yes good point, maybe it would make sense to extend the type argument {supervisor, HibernateAfter} ?! We need to think about it I will bring the question up in my team too. Will get back to you.

@Maria-12648430
Copy link
Contributor

@IngelaAndin Hm, in the child specs, really? 🤔

By intuition, I don't think that is a good way to go. It would mean that the supervisor has to know more about the child it wants to start than it actually should.
Also, how would the hibernation info be brought to the point where it is needed? I mean, if we just passed the hibernate_after option to the gen_server layer, it needs to get into the start function (and from there through the supervisor:start_link function it will probably call to the gen_server:start_link function that it in turn calls); if instead we handled hibernation timeouts in a custom fashion inside the supervisor code, it still needs to be brought into that process somehow.
And I'm not sure how this would play with custom supervisor implementations (IIRC ranch has such a thing, a process that behaves like a supervisor but is really a streamlined custom implementation with some extra features).
Finally, I don't know about changing the type option of a child spec. Isn't that used in release upgrades?

I think this belongs in the supervisor flags instead. Those concern the supervisor itself (vs its children), so there is no outside-of-the-module knowledge needed. Arguably, a supervisor knows best what it is for and how it should behave regarding hibernation. If it wants or needs to be told how to behave hibernation-wise by a parent supervisor or something else, that is possible to do via init arguments.


Roughly sketching this up in my mind (and assuming we put the option in the supervisor flags), it could look simple as this: We put the timeout into all returns from handle_call/_cast/_info, and we handle the timeout by going into hibernation. If the hibernation timeout option is absent, it defaults to 0 for static and infinity for dynamic supervisors.
(The current implementation is for static supervisors to go into hibernation after init, but not again when awakened afterwards. If we want to keep this behavior in case of the flag being absent, it would require only a little more work I think.)

@IngelaAndin
Copy link
Contributor

@Maria-12648430 yes maybe I was being too reactive here, and maybe hibernate_after could actually be applied to workers too if you really want, as they need to be well behaved OTP-processes that would be expected to handle hibernate_after at least for the common case and if you make a very customized one you could also be expected to know that hibernate_after will not work.

@NelsonVides
Copy link
Contributor Author

I think I agree with @Maria-12648430 here, adding this to the child_spec means the supervisor needs to keep track of more things, and, it needs to pass this information to the children in a way that the children need to understand: that means, all the custom supervisors and worker children out there in the world would need to extend their APIs to understand this new information, and the supervisor would need to check if the given children is compatible with one or another API... and then we one day want to add spawn_opt or a timeout flag or other new functionality and again the child_spec needs more extensions...

And the child_spec currently already has a way to pass information to its children: as given at their start key in the spec. This is something that I have already done in child_specs for workers, just setting something like start => {gen_server, start_link, [MyModuleImplementation, Args, [{hibernate_after, 0}]]}, and then it is already in the supervisor what to do with it, but the supervisor just doesn't know it yet.

If you wanted to keep that hibernate_after flag in the supervisor for SSL, we could do

Options = [{hibernate_after, 1000}],
ChildSpecs = [#{id      => undefined,
                restart => temporary,
                type    => supervisor,
                start   => {tls_dyn_connection_sup, start_link, [Options]}}],

and then in the implementation of tls_dyn_connection_sup:start_link/3, we could do

-module(tls_dyn_connection_sup).
start_link(Options) ->
    supervisor:start_link(?MODULE, [], Options).

So if we nevertheless need to expand the supervisor API to get the hibernate flag, it seems to me like we don't need to introduce a second mechanism in the child_spec to pass flags (which are just input data for the children really), we can just let child-specs keep passing the init arguments, and in that case we as a user can decide whether to pass hibernate arguments to children we know will understand the parameter and to not pass them to children that won't, leaving that decision to the supervisor's client code, and not to the internal supervisor implementation.

@Maria-12648430
Copy link
Contributor

@IngelaAndin not sure we're talking about the same thing 😅 I made a quick proof of concept here, to show what I have in mind: Maria-12648430#1

@IngelaAndin
Copy link
Contributor

IngelaAndin commented Jan 8, 2025

@Maria-12648430 and @NelsonVides Have not had time yet to check, but I am starting to feel that maybe I wrote to much gen_statem code and to little gen_server code lately and I think that this is part of the confusion. I will check your suggestion tomorrow. I am starting to feel that "over-load" start_link for supervisor could be ok. I will see if I can also merge my PR using which_child/2 tomorrow.

@Maria-12648430
Copy link
Contributor

@IngelaAndin, @NelsonVides

I had a deeper look into the related parts of the code (including @NelsonVides' suggestions in this PR), and this is what I think now.

Comparing @NelsonVides' changes with what I drafted, the difference is that his approach passes the hibernate_after (and other) options to the gen_server layer and lets it handle it "transparently", whereas mine (based on @IngelaAndin's suggestion) takes it from the supervisor flags and requires a custom implementation in the supervisor module leveraging timeouts. Both have merits IMO.

One downside of @NelsonVides' approach however is that the option can only be set from the "outside", ie from the starting process, by passing the hibernate_after option to gen_server:start_link via supervisor:start_link. This then raises the question about how we are going to select a good default: by default, static supervisors should hibernate instantly, dynamic supervisors should never hibernate; however, the supervisor strategy is specified in the init result, ie when the process has already been started, ie past the one point where hibernate_after could be set. Since there is likely a start function in the specific supervisor implementation, ie in the same module, this may be easy to mitigate. However, this is circumstantial, based on best practices, ie not guaranteed.

Finally, this also means that the hibernation value can not be changed (AFAICS) during a release upgrade.

A downside of my drafted approach is, apart from requiring a custom hibernate-after handling implementation, that it relies on yet another message (timeout) flying around to make it work.


It may be possible to combine both approaches, like this: Do not pass the hibernate_after option to the gen_server layer, instead filter it out and pass it to the supervisor layer, where it is then handled "my way". A question that should be considered here is what takes precedence if hibernate_after is both given in the start options and present in the supervisor flags.


On a kinda side note, going into instant hibernation after/in init (by returning {..., hibernate}) doesn't seem to be the best of solutions, even for static supervisors. One of the started children may send a message to the parent supervisor (eg, which_children) in a gen_server:handle_continue (or gen_statem equivalent). The supervisor would go into hibernation only to be instantly awakened again.

@IngelaAndin
Copy link
Contributor

@Maria-12648430, @NelsonVides

Ok, so in my world static supervisors are mostly very inactive processes that are started at application startup and that in turn start up their worker processes that may be more supervisors and some actual workers. Then the supervisors hopefully have nothing to do until the application is stopped when they are involved stopping everything as gracefully as possible. So this of course is not entirely true as there might occur intermediate problems and process that will be restarted and soft-code upgrades and the
occasional calls to supervisor API functions (mostly which_childen) . But the current behavior of hibernating the static supervisor is under the assumption that this will be true for most of these supervisors and that it will help save memory, as garbage generated when starting children is collected and heaps are reduced. Actually it is ironic that this did not apply to the tls_connection_sup as it was one of the processes that we want this to happen for. However it was reported to to us that the hibernation of static supervisor change did cause a significant reduction of over all memory usage.

So now anyway with supervisor:which_children/2 I guess it will be more common that a hibernated supervisor is immediately woke up, so this is a problem that I think we would like to avoid. Which is also motivated for some of the other cases you mentioned.

So I guess we are free to change how the static supervisor chooses to hibernate but we do still need this to happen for most of them without extra configuration. So not really reaching any conclusion yet, just trying to contribute to the considerations we have to make.

Having a an extra idle-timout for a process that will be mostly hibernated is probably ok, but not so much for a busy process that a simple_one_for_one-supervisor normaly is.

@Maria-12648430
Copy link
Contributor

Maria-12648430 commented Jan 9, 2025

@IngelaAndin Well... I think we have reached a point where we need a decision from you (and/or the OTP team) to progress further.

Let me sum up the options we discussed until now:

  • let hibernate_after be a start option, pass it (together with possible other options) up to the gen_server layer and let this handle it; this includes the introduction of supervisor:start_link/4 and overloading start_link/3 (@NelsonVides' proposal, probably simplest of all)
  • let hibernate_after be a supervisor flag and handle it specifically with timeouts in the supervisor layer (proposal from my proof of concept); if a timeout needs to be passed in, it can be done via an argument to supervisor:init
  • a crossover of both proposals, ie let hibernate_after be both a start option (but don't pass it to the gen_server but to the supervisor layer and handle it there with timeouts) and a supervisor flag; there needs to be a decision which one takes precedence if both are used
  • something entirely different
  • leave everything as is

Open questions:

  • should there be a single timeout that applies to both hibernation after start and being idle, or two (ie, one for each)? (If two, we have to go with "my" proposal or the crossover, because gen_server only accepts one)
  • should setting hibernate_after be allowed for simple_one_for_one supervisors? There is nothing speaking against allowing it technically, but it may turn out to be one of those "perfectly allowed bad ideas". (If we want to disallow it, we have to go with "my" proposal or the crossover because the supervisor strategy is known only from the return of supervisor:init, ie too late to pass in as a start option)
  • if we allow it to be set for dynamic supervisors, a similar question (with similar restrictions for selecting an approach) arises for selecting a default depending on the strategy (for static supervisors it should be 0, for dynamic supervisors it should be infinity)
  • in case no hibernate_timeout is set, should we keep complete backwards compatibility with regard to the current behavior that a static supervisor goes into hibernation right after init but when awakened never goes back into hibernation? Or would it instead be acceptable (or even better) to always go back into hibernation?

Re-reading the above, it looks like I'm biased towards "my" or a crossover implementation, even though I did my best to keep it down 🙀 Anyway, feel free to add whatever you think I forgot 😉

@Maria-12648430
Copy link
Contributor

Then the supervisors hopefully have nothing to do until the application is stopped...
... and the occasional calls to supervisor API functions (mostly which_childen)

Hm, even if they do happen occasionally, a supervisor will not go back into hibernation afterwards 🤔 With one_for_all and rest_for_one supervisors where the children are supposed to be (inter-)dependent and likely form a cooperating work unit, at least some of those children will want/need to know their siblings in order to talk to them, and the only way (outside of named processes) is to ask the supervisor, via which_child/2 or which_children/1 🤷‍♀️ This is my experience at least.

@IngelaAndin
Copy link
Contributor

Then the supervisors hopefully have nothing to do until the application is stopped...
... and the occasional calls to supervisor API functions (mostly which_childen)

Hm, even if they do happen occasionally, a supervisor will not go back into hibernation afterwards 🤔 With one_for_all and rest_for_one supervisors where the children are supposed to be (inter-)dependent and likely form a cooperating work unit, at least some of those children will want/need to know their siblings in order to talk to them, and the only way (outside of named processes) is to ask the supervisor, via which_child/2 or which_children/1 🤷‍♀️ This is my experience at least.

Yes I think this is the sane way to do it, especially since we got which_child/2. Mostly I have seen registered names used, which in some cases lead to really awful naming conventions and unexpected race-condition problems. The only other way I used was to start the children in a more manual fashion in order to be able to use supervisor function return as init-argument to second child. And this will only work for more dynamic parts of a supervisor tree, and I did not like it, I think it became ugly and need temporary processes to work properly.

@IngelaAndin
Copy link
Contributor

I do not think we need to be backwards compatible in the way that we only ever hibernate after init, that is an implementation detail. The thing I think is problematic is to remove it unless explicit configured as we probably would get complaints that memory usage has gone up. I will involve more team members also in discussion but is nice to have more people to discuss with :)

@Maria-12648430
Copy link
Contributor

I updated my PoC by removing the strict backwards compatibility. By default (ie, if no hibernate_after is configured), static supervisors will now go into instant hibernation after init as well as after each operation (like which_child/2), dynamic supervisors will never go into hibernation.

@NelsonVides
Copy link
Contributor Author

Hey there!

I was wondering, did anything come out of this discussion in the OTP team? @IngelaAndin

I find @Maria-12648430 draft pretty nice, code is clean, I just find unfortunate to have to duplicate the logic that gen_server already implements elegantly, but well 🤷🏽‍♂️ When it comes to the flag added to the supervisor config, I find it a lot better to add it to the sup_flags than to the child_spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants