Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: in flux jobs, show jobs with nodes still in housekeeping #6248

Open
kkier opened this issue Aug 30, 2024 · 4 comments
Open

Suggestion: in flux jobs, show jobs with nodes still in housekeeping #6248

kkier opened this issue Aug 30, 2024 · 4 comments

Comments

@kkier
Copy link
Contributor

kkier commented Aug 30, 2024

There's a corner case we keep running into (which I guess makes it a very wide corner):

  • flux resource status shows N nodes available
  • flux jobs -A shows no running jobs
  • User submits job that will use N nodes
  • Because 1+ nodes are in housekeeping, job will fail to allocate and start, but will not error out

I realize this is covered by also checking flux resource list for nodes marked as allocated, so the change would be more of a QoL improvement for users and admins than anything else. But since it's my life we're talking about, naturally I'm all for it.

Possible implementation - new job state between CLEANUP and INACTIVE.

@garlick
Copy link
Member

garlick commented Aug 30, 2024

One thing we talked about when housekeeping was being designed was representing housekeeping as a separate system job running as the flux user. I started to implement that but found it was more challenging than expected, and we needed something to staunch the bleeding on el cap. Maybe reviving that could address this issue without introducing unnecessary coupling between the original job and housekeeping.

@grondo
Copy link
Contributor

grondo commented Aug 30, 2024

I haven't looked into how difficult this would be, but maybe flux resource status could be augmented to show nodes that are currently in housekeeping? That might be more straightforward than trying to represent housekeeping as a job, and we do already support the ephemeral torpid state in flux resource status as a precedent.

I can't decide which is the "right" approach here though. Representing the housekeeping workload as a job is an attractive option, but since it isn't a job there would be so many "job" things that won't work it almost seems like it could possibly cause more trouble down the road...

@garlick
Copy link
Member

garlick commented Aug 30, 2024

Hmm yeah, that is partly what made the job idea hard. Going all the way and making it a real job with all the trimmings seems like overkill.

@grondo
Copy link
Contributor

grondo commented Sep 3, 2024

It is trivial to add a new housekeeping state to flux resource status, e.g.:

$ src/cmd/flux resource status
       STATE UP NNODES NODELIST
       avail  ✔    101 corona[171,173-186,188-194,196-207,213-214,219,221-230,232-250,252-253,255-259,261-269,272-275,277-278,280,282-285,287-290,292-294,296]
     exclude  ✔      4 corona[81-82,211-212]
    exclude*  ✗      1 corona260
housekeeping  ✔      2 corona[189-190]
    drained*  ✗      7 corona[172,187,231,254,279,291,295]
     drained  ✔     12 corona[195,215-218,220,251,270-271,276,281,286]

note that the avail state in flux resource status doesn't mean available to jobs now, but the nodes that are not excluded, drained, or in v0.66.0 and later, currently marked as torpid. So the avail node set does include the housekeeping node set, unless we think that should change?

Also, for some reason, flux resource status doesn't currently support listing allocated nodes, even though it has the information. If we enable that, then flux resource status -s all could be used to display all states including any allocated nodes (which includes those in housekeeping)

At this point, I'm not sure if the above is helpful or not. It is easy to include it, and maybe put neither housekeeping nor allocated in the default output. To get full details you'll need to run flux resource status -s all and then be cognizant of the fact that some of the sets may have some overlap.

I'm open to any feedback here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants