Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

garlick
Copy link
Member

@garlick garlick commented Feb 7, 2025

Problem: after a broker restart, "stuck" housekeeping, epilog, prolog, or job shell tasks might still be running, but flux is unaware and new work may be scheduled there even though there might be a problem, or those tasks might be holding on to resources.

When those things are run under systemd, we have the machinery for finding them and tracking them readily at hand.

This PR does the following

  • modifies sdbus so it can be loaded twice, once for user and once for system systemd instances
  • add a sdmon monitoring module that tracks running flux units on both busses
  • have sdmon join a sdmon.idle broker group at startup, after it has verified that no units are running
  • modify the resource module so it monitors sdmon.idle instead of broker.online when configured to use systemd

The net effect is that nodes that require cleanup remain offline until the node is cleaned up.

sdmon also logs the systemd units it has found. Here's an example where I kill -9 a broker while housekeeping is running, then start it back up again

[  +0.192069] sdmon.err[7]: [email protected] needs cleanup - resources are offline
...
[ +26.900866] sdmon.err[7]: cleanup complete - resources are online

Before cleanup is complete, flux resource status reports

      STATE UP NNODES NODELIST
       avail  ✔      7 picl[0-6]
      avail*  ✗      1 picl7

Seems like this does the bare minimum to resolve #6590

This does seem a bit thin in the tooling department. The possibilities are pretty broad so for now, I wanted to get this posted and get feedback on the way the resource module is tied into sdmon using broker groups.

Problem: a comment has an extra "to" that makes the sentence incorrect.

Drop the extra word.
Problem: the timer used by sdbus_connect() is hard to modify because
of the embedded error handling.

Extract a function for building the user bus path for the error log.
Now the timer is a bit simpler.
Problem: the sdbus module is hardwired to connect to a systemd
user instance, but Flux now has "work" running in the systemd
system instance as well (prolog, epilog, housekeeping).

Add a "system" module option which directs sdbus to connect
to the systemd system instance instead.  Future commits will allow
a second instance of the sdbus module to be loaded with this option
so access to both systemd instances can be handled concurrently.
Problem: the sdbus system option has no coverage.

Amend the 2407-sdbus.t test with a simple test of "system mode".
Problem: the sdbus module can only be loaded once because it
uses an explicit service name.

Drop the outdated MOD_NAME() symbol declaration.
Register methods in a way that lets the default service name change.
Update the self-contacting "subscribe" composite RPC to determine
the topic string to contact programmatically.

Now the module can be loaded as many times as we like using e.g.
  flux module load --name NEWNAME sdbus
Problem: there are no tests for loading sdbus under a different name

Modify the system test to load sdbus under the name "sdbus-sys" in system
mode instead of reloading the module.  Show that it works for listing
units in the system instance.
Problem: when the system is configured to use systemd, sdbus is only
loaded for the systemd user instance.

Load sdbus-sys as well.
Problem: some libsdexec RPCs can now be directed to different
services to reach the systemd system or user instance.

Add a service parameter to the following functions:
  sdexec_list_units()
  sdexec_property_get()
  sdexec_property_get_all()
  sdexec_property_changed()

Update sdexec.
Update tests.
@grondo
Copy link
Contributor

grondo commented Feb 7, 2025

Wow, nice! I don't have any qualms with using an sdmon.idle broker group for the current implementation.

It seems like eventually we'd want various subsystems to be able to recapture their state from what sdmon has found. (for example the job execution system could reconnect to running jobs after restart, or terminate jobs that are not supposed to be running, the job manager could do something similar for prolog/epilog and housekeeping. Any thoughts on how that might work? I realize bringing that up is a bit premature, but it could inform the solution here as a stepping stone. (I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Also, since the sdmon.idle group is never left once joined (except in the case of a broker restart I'm assuming?), I'm wondering if there's a better term. I can't think of anything though, and it doesn't seem important enough to worry about it now (especially since it isn't really user/admin visible)

@garlick
Copy link
Member Author

garlick commented Feb 7, 2025

Maybe sdmon.online is a better name for the group. (It started out as an idle group that was kept up to date and causes resource to post idle/busy events. But then I realized it could be a lot simpler and didn't revisit the group name)

Problem: there is no mechanism to track systemd units across
a broker restart.

Add a broker module that creates and maintains a list of running
flux systemd units.

This monitors two instances of systemd:
- the user one, running as user flux (where jobs are run)
- the system one (where housekeeping, prolog, epilog run)

A list of units matching flux unit globs is requested at initialization,
and a subscription to property updates on those globs is obtained.
After the initial list, monitoring is driven solely by property updates.

Join the sdmon.online broker group once the node is demonstrably idle.
This lets the resource module on rank 0 notice compute nodes that need
cleanup at restart and withhold them from the scheduler.  Once the group
is joined, sdmon does not explicitly leave it.  It implicitly leaves the
group if sdmon is unloaded or the node goes offline/lost.

If there are running units at startup, log this information at LOG_ERR
level, and again when the units are cleaned up, e.g.

 [email protected] needs cleanup - resources are offline
 cleanup complete - resources are online

In the future, this module's role could be expanded to support tooling
for listing running work and obtaining runtime information such as
pids and cgroup resource parameters.  It could also play a role in
informing other flux components about work that should be re-attached
after a full or partial restart, when support for that is added.
Problem: the sdmon module is not loaded by default.

Load it if systemd.enable = true in the configuration.
Problem: the monitor subsystem of the resource module needs to
know whether the "sdmon.online" broker group will be populated.

Parse the enable key from [systemd].
Pass the whole resource_config struct to the monitor subsystem
instead of just monitor_force_up.
Problem: nodes are not checked for untracked running work when a
Flux instance starts up.

This might happen, for example, if
- job-exec deems job shell(s) unkillable
- housekeeping/prolog/epilog gets stuck on a hung file system
- the broker exits without proper shutdown

When systemd is enabled, the new sdmon module joins the 'sdmon.online'
broker group on startup.  However, if there are any running flux units,
this is delayed until those units are no longer running.

Change the resource module so that it monitors sdmon.online instead of
broker.online when systemd is enabled.  This will withhold "busy" nodes
from the scheduler until they become idle.

Fixes flux-framework#6590
Problem: there is no test coverage for the sdmon module.

Add a new sharness script.
@garlick
Copy link
Member Author

garlick commented Feb 7, 2025

Renamed the group and fixed a spelling error in a test caught in CI.

This still needs a test for the resource module portion of the proposed change so I'll leave it WIP for the moment.

Problem: there is no test coverage for the resource module's
behavior when systemd is configured and sdmon is providing
sdmon.online.

Add a sharness script for that.
@garlick garlick changed the title WIP avoid scheduling jobs on compute nodes that are not cleaned up avoid scheduling jobs on compute nodes that are not cleaned up Feb 7, 2025
@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

I added the missing test, so I'll drop the WIP.

One thing I should do before we merge this though is make sure the systemd shipped with RHEL 8 allows sdbus to authetnicate to it with flux credentials. I'll try to test that on fluke.

Copy link

codecov bot commented Feb 8, 2025

Codecov Report

Attention: Patch coverage is 72.76265% with 70 lines in your changes missing coverage. Please review.

Project coverage is 79.52%. Comparing base (d9cde83) to head (f7bdb9d).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/sdmon/sdmon.c 71.87% 54 Missing ⚠️
src/common/libsdexec/property.c 33.33% 6 Missing ⚠️
src/modules/sdbus/sdbus.c 57.14% 6 Missing ⚠️
src/modules/sdbus/connect.c 82.60% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6616      +/-   ##
==========================================
+ Coverage   79.50%   79.52%   +0.02%     
==========================================
  Files         531      532       +1     
  Lines       88363    88597     +234     
==========================================
+ Hits        70251    70456     +205     
- Misses      18112    18141      +29     
Files with missing lines Coverage Δ
src/common/libsdexec/list.c 85.71% <100.00%> (+85.71%) ⬆️
src/modules/resource/monitor.c 70.49% <100.00%> (+0.49%) ⬆️
src/modules/resource/resource.c 86.80% <100.00%> (+0.13%) ⬆️
src/modules/sdbus/main.c 69.23% <100.00%> (ø)
src/modules/sdbus/subscribe.c 70.00% <100.00%> (+7.50%) ⬆️
src/modules/sdexec/sdexec.c 70.87% <ø> (ø)
src/modules/sdbus/connect.c 77.27% <82.60%> (+9.27%) ⬆️
src/common/libsdexec/property.c 41.86% <33.33%> (+4.36%) ⬆️
src/modules/sdbus/sdbus.c 69.42% <57.14%> (-0.84%) ⬇️
src/modules/sdmon/sdmon.c 71.87% <71.87%> (ø)

... and 12 files with indirect coverage changes

@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

Yep that worked

2025-02-07T16:41:43.412549-08:00 sdbus.info[0]: sd_bus_open_system: connected

@garlick
Copy link
Member Author

garlick commented Feb 8, 2025

(I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining sdmod.idle)

Right, I like that way of thinking about it.

Hmm, we should also be trying to capture the state of any units that have completed but weren't reaped, and put that in a lost+found or something. I need to refresh my memory on what happens to that state for the cases we're discussing here (the templated system units and transient user units). That could be a follow-on PR.

But anyway, yeah, if a running unit can be reclaimed, we could let the node join sdmon.online before it terminates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

after an emergency restart, flux doesn't know about user processes that are still running
2 participants