avoid scheduling jobs on compute nodes that are not cleaned up #6616

garlick · 2025-02-07T20:21:28Z

Problem: after a broker restart, "stuck" housekeeping, epilog, prolog, or job shell tasks might still be running, but flux is unaware and new work may be scheduled there even though there might be a problem, or those tasks might be holding on to resources.

When those things are run under systemd, we have the machinery for finding them and tracking them readily at hand.

This PR does the following

modifies sdbus so it can be loaded twice, once for user and once for system systemd instances
add a sdmon monitoring module that tracks running flux units on both busses
have sdmon join a sdmon.idle broker group at startup, after it has verified that no units are running
modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

The net effect is that nodes that require cleanup remain offline until the node is cleaned up.

sdmon also logs the systemd units it has found. Here's an example where I kill -9 a broker while housekeeping is running, then start it back up again

[  +0.192069] sdmon.err[7]: [email protected] needs cleanup - resources are offline
...
[ +26.900866] sdmon.err[7]: cleanup complete - resources are online

Before cleanup is complete, flux resource status reports

      STATE UP NNODES NODELIST
       avail  ✔      7 picl[0-6]
      avail*  ✗      1 picl7

Seems like this does the bare minimum to resolve #6590

This does seem a bit thin in the tooling department. The possibilities are pretty broad so for now, I wanted to get this posted and get feedback on the way the resource module is tied into sdmon using broker groups.

Problem: a comment has an extra "to" that makes the sentence incorrect. Drop the extra word.

Problem: the timer used by sdbus_connect() is hard to modify because of the embedded error handling. Extract a function for building the user bus path for the error log. Now the timer is a bit simpler.

Problem: the sdbus module is hardwired to connect to a systemd user instance, but Flux now has "work" running in the systemd system instance as well (prolog, epilog, housekeeping). Add a "system" module option which directs sdbus to connect to the systemd system instance instead. Future commits will allow a second instance of the sdbus module to be loaded with this option so access to both systemd instances can be handled concurrently.

Problem: the sdbus system option has no coverage. Amend the 2407-sdbus.t test with a simple test of "system mode".

Problem: the sdbus module can only be loaded once because it uses an explicit service name. Drop the outdated MOD_NAME() symbol declaration. Register methods in a way that lets the default service name change. Update the self-contacting "subscribe" composite RPC to determine the topic string to contact programmatically. Now the module can be loaded as many times as we like using e.g. flux module load --name NEWNAME sdbus

Problem: there are no tests for loading sdbus under a different name Modify the system test to load sdbus under the name "sdbus-sys" in system mode instead of reloading the module. Show that it works for listing units in the system instance.

Problem: when the system is configured to use systemd, sdbus is only loaded for the systemd user instance. Load sdbus-sys as well.

Problem: some libsdexec RPCs can now be directed to different services to reach the systemd system or user instance. Add a service parameter to the following functions: sdexec_list_units() sdexec_property_get() sdexec_property_get_all() sdexec_property_changed() Update sdexec. Update tests.

grondo · 2025-02-07T20:44:15Z

Wow, nice! I don't have any qualms with using an sdmon.idle broker group for the current implementation.

It seems like eventually we'd want various subsystems to be able to recapture their state from what sdmon has found. (for example the job execution system could reconnect to running jobs after restart, or terminate jobs that are not supposed to be running, the job manager could do something similar for prolog/epilog and housekeeping. Any thoughts on how that might work? I realize bringing that up is a bit premature, but it could inform the solution here as a stepping stone. (I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Also, since the sdmon.idle group is never left once joined (except in the case of a broker restart I'm assuming?), I'm wondering if there's a better term. I can't think of anything though, and it doesn't seem important enough to worry about it now (especially since it isn't really user/admin visible)

garlick · 2025-02-07T21:07:15Z

Maybe sdmon.online is a better name for the group. (It started out as an idle group that was kept up to date and causes resource to post idle/busy events. But then I realized it could be a lot simpler and didn't revisit the group name)

Problem: there is no mechanism to track systemd units across a broker restart. Add a broker module that creates and maintains a list of running flux systemd units. This monitors two instances of systemd: - the user one, running as user flux (where jobs are run) - the system one (where housekeeping, prolog, epilog run) A list of units matching flux unit globs is requested at initialization, and a subscription to property updates on those globs is obtained. After the initial list, monitoring is driven solely by property updates. Join the sdmon.online broker group once the node is demonstrably idle. This lets the resource module on rank 0 notice compute nodes that need cleanup at restart and withhold them from the scheduler. Once the group is joined, sdmon does not explicitly leave it. It implicitly leaves the group if sdmon is unloaded or the node goes offline/lost. If there are running units at startup, log this information at LOG_ERR level, and again when the units are cleaned up, e.g. [email protected] needs cleanup - resources are offline cleanup complete - resources are online In the future, this module's role could be expanded to support tooling for listing running work and obtaining runtime information such as pids and cgroup resource parameters. It could also play a role in informing other flux components about work that should be re-attached after a full or partial restart, when support for that is added.

Problem: the sdmon module is not loaded by default. Load it if systemd.enable = true in the configuration.

Problem: the monitor subsystem of the resource module needs to know whether the "sdmon.online" broker group will be populated. Parse the enable key from [systemd]. Pass the whole resource_config struct to the monitor subsystem instead of just monitor_force_up.

Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system - the broker exits without proper shutdown When systemd is enabled, the new sdmon module joins the 'sdmon.online' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.online instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590

Problem: there is no test coverage for the sdmon module. Add a new sharness script.

garlick · 2025-02-07T22:19:59Z

Renamed the group and fixed a spelling error in a test caught in CI.

This still needs a test for the resource module portion of the proposed change so I'll leave it WIP for the moment.

Problem: there is no test coverage for the resource module's behavior when systemd is configured and sdmon is providing sdmon.online. Add a sharness script for that.

garlick · 2025-02-08T00:01:34Z

I added the missing test, so I'll drop the WIP.

One thing I should do before we merge this though is make sure the systemd shipped with RHEL 8 allows sdbus to authetnicate to it with flux credentials. I'll try to test that on fluke.

codecov · 2025-02-08T00:18:54Z

Codecov Report

Attention: Patch coverage is 72.76265% with 70 lines in your changes missing coverage. Please review.

Project coverage is 79.52%. Comparing base (d9cde83) to head (f7bdb9d).
Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/sdmon/sdmon.c	71.87%	54 Missing ⚠️
src/common/libsdexec/property.c	33.33%	6 Missing ⚠️
src/modules/sdbus/sdbus.c	57.14%	6 Missing ⚠️
src/modules/sdbus/connect.c	82.60%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6616      +/-   ##
==========================================
+ Coverage   79.50%   79.52%   +0.02%     
==========================================
  Files         531      532       +1     
  Lines       88363    88597     +234     
==========================================
+ Hits        70251    70456     +205     
- Misses      18112    18141      +29

Files with missing lines	Coverage Δ
src/common/libsdexec/list.c	`85.71% <100.00%> (+85.71%)`	⬆️
src/modules/resource/monitor.c	`70.49% <100.00%> (+0.49%)`	⬆️
src/modules/resource/resource.c	`86.80% <100.00%> (+0.13%)`	⬆️
src/modules/sdbus/main.c	`69.23% <100.00%> (ø)`
src/modules/sdbus/subscribe.c	`70.00% <100.00%> (+7.50%)`	⬆️
src/modules/sdexec/sdexec.c	`70.87% <ø> (ø)`
src/modules/sdbus/connect.c	`77.27% <82.60%> (+9.27%)`	⬆️
src/common/libsdexec/property.c	`41.86% <33.33%> (+4.36%)`	⬆️
src/modules/sdbus/sdbus.c	`69.42% <57.14%> (-0.84%)`	⬇️
src/modules/sdmon/sdmon.c	`71.87% <71.87%> (ø)`

... and 12 files with indirect coverage changes

garlick · 2025-02-08T00:46:10Z

Yep that worked

2025-02-07T16:41:43.412549-08:00 sdbus.info[0]: sd_bus_open_system: connected

garlick · 2025-02-08T15:10:42Z

(I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Right, I like that way of thinking about it.

Hmm, we should also be trying to capture the state of any units that have completed but weren't reaped, and put that in a lost+found or something. I need to refresh my memory on what happens to that state for the cases we're discussing here (the templated system units and transient user units). That could be a follow-on PR.

But anyway, yeah, if a running unit can be reclaimed, we could let the node join sdmon.online before it terminates.

garlick added 8 commits February 7, 2025 08:16

sdexec: fix comment grammar

1888feb

Problem: a comment has an extra "to" that makes the sentence incorrect. Drop the extra word.

sdbus: clean up error handling

5a0108c

Problem: the timer used by sdbus_connect() is hard to modify because of the embedded error handling. Extract a function for building the user bus path for the error log. Now the timer is a bit simpler.

testsuite: cover sdbus system option

1d4f890

Problem: the sdbus system option has no coverage. Amend the 2407-sdbus.t test with a simple test of "system mode".

testsuite: cover sdbus-sys

f5df483

Problem: there are no tests for loading sdbus under a different name Modify the system test to load sdbus under the name "sdbus-sys" in system mode instead of reloading the module. Show that it works for listing units in the system instance.

rc: load sdbus-sys

fde0045

Problem: when the system is configured to use systemd, sdbus is only loaded for the systemd user instance. Load sdbus-sys as well.

garlick added 5 commits February 7, 2025 14:12

rc: conditionally load sdmon module

cdee565

Problem: the sdmon module is not loaded by default. Load it if systemd.enable = true in the configuration.

testsuite: add sdmon test script

42e8a8f

Problem: there is no test coverage for the sdmon module. Add a new sharness script.

garlick force-pushed the issue#6590 branch from 2a22573 to 42e8a8f Compare February 7, 2025 22:19

testsuite: cover resource with sdmon.online

f7bdb9d

Problem: there is no test coverage for the resource module's behavior when systemd is configured and sdmon is providing sdmon.online. Add a sharness script for that.

garlick changed the title ~~WIP avoid scheduling jobs on compute nodes that are not cleaned up~~ avoid scheduling jobs on compute nodes that are not cleaned up Feb 7, 2025

garlick mentioned this pull request Feb 10, 2025

can we run prolog/epilog/housekeeping in the flux systemd instance? #6619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid scheduling jobs on compute nodes that are not cleaned up #6616

avoid scheduling jobs on compute nodes that are not cleaned up #6616

garlick commented Feb 7, 2025 •

edited

Loading

grondo commented Feb 7, 2025

garlick commented Feb 7, 2025 •

edited

Loading

garlick commented Feb 7, 2025

garlick commented Feb 8, 2025

codecov bot commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Are you sure you want to change the base?

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Conversation

garlick commented Feb 7, 2025 • edited Loading

grondo commented Feb 7, 2025

garlick commented Feb 7, 2025 • edited Loading

garlick commented Feb 7, 2025

garlick commented Feb 8, 2025

codecov bot commented Feb 8, 2025

Codecov Report

garlick commented Feb 8, 2025

garlick commented Feb 8, 2025

garlick commented Feb 7, 2025 •

edited

Loading

garlick commented Feb 7, 2025 •

edited

Loading