WIP: enable RFC 14 resource ranges #6632

sam-maloney · 2025-02-13T09:49:42Z

This is my initial attempt to investigate what is required to enable resource ranges in jobspec à la RFC 14. Very rough at this point (in particular there is lots of work to do on the testsuite) but I am interested in discussion about some potential design considerations that have come up. The code to this point allows jobspec with ranges that is otherwise V1 compliant to be allocated and run successfully, but at least the job-list service still issues a non-fatal error, and I've just ignored sched-simple at this point.

Firstly, because flux-sched already largely supports ranges, once a couple of explicit version checks are tweaked in libjob and shell, we can already get back a resource set allocation from fluxion. From a design perspective, my personal opinion about these version checks is that it should be exclusively the remit of the validator to actually fail jobspec purely based on the version number. If a jobspec passes validation, then it seems to me that other components should at most warn about unsupported version numbers, but otherwise try their best to handle the rest of the jobspec (in future, I suppose a version check could also determine different code pathways for different versions). This would allow someone in future to define an extension of a supported version and not have other components arbitrarily fail even when all the information they require is present and conformant.

Secondly, the lack of information about slots in RFC 20 resource sets is problematic. In short, in my opinion the shell really shouldn't have to parse the jobspec resource section at all to determine what resources it will run on, it should just be able to consult the R allocation returned by the scheduler. flux-sched can already handle ranges for any/all of the resources in V1 jobspec, but if a range is used for the slot resource then the slot count decided by the scheduler is not recorded in R. If a fixed count was used for the cores then the slot count can be determined fairly easily in V1, but if a range was also used for the cores then the shell has to re-determine (guess, really) what the chosen slot and core counts should be (e.g. if 6 cores are allocated per node in R, then you could potentially have slot:core counts of 1:6, 2:3, 3:2, or 6:1).

In my mind the ideal long-term solution is to have slot information included in the allocated resource set; I know it's a "virtual" resource, but it's still part of the scheduler's decision making, and especially looking ahead to more complicated possibilities of canonical jobspec it seems like it will definitely be important to preserve the slot information in the resource graph allocated by the scheduler.

For now, I've added infrastructure to the shell initialization that ~~guesses~~ resolves a slot:core count combination if needed, but it perhaps raises the question of whether it even makes sense to allow ranges for both slot and core resources simultaneously? Philosophically, I generally prefer having fewer restrictions, lest some future user come up with a strange use case one never predicted, but requiring a fixed count for at least one of the slot or core resources does greatly simplify things at this point, and it doesn't seem too restrictive to forbid leaving it entirely to the scheduler to decide what combination of number of slots and cores are allocated!

src/shell/jobspec.c

+        }
+    }
+    // no specific error was detected, but no matching combination was found
+    if (slot_count < 1) {


Problem: version check in jj.c triggers fatal error if version != 1 which will be needed for resource ranges. Change the error to instead be a warning that version numbers other than 1 are unsupported, but execution will continue.

Problem: now that version numbers other than 1 continue to execute in libjob, several tests fail that expect such jobs to terminate. Modify the tests to expect successful completion, and detect new warning message in t0022.

Problem: version check in jobspec.c triggers fatal error if version!=1 which will be needed for resource ranges. Change the error to instead be a warning that version numbers other than 1 are unsupported, but execution will continue.

Problem: other commits in this PR allow version numbers other than 1, so we need a different parameter in 2608 to cause early shell error. Modifies the early fatal error test in t2608 to use an integer instead of required dictionary for '.attributes.system.environment'.

Problem: jobspec_parse() function needs information on actual resource allocation in order to resolve ranges in jobspec, but is currently just getting jobspec string. As an initial step to simplify the dataflow to jobspec_parse(), move the initial conversion of jobspec from string to json and field extraction from jobspec_parse() into lookup_jobspec_get().

Problem: counts for resources are currently being parsed only from jobspec, but if ranges are used then values cannot be determined. Modify shell init data flow to use the actual allocation from the scheduler in R to determine node and core counts.

Problem: slots are not included in the RFC 20 resource set R that is returned by the scheduler, so if both slots and cores are ranges, one must resolve them consistently (i.e. guess what the scheduler did) Add infrastructure for parsing and looping over RFC 14 ranges and use it to determine a consistent node/slot/core count combination

github-advanced-security bot found potential problems Feb 13, 2025

View reviewed changes

src/shell/jobspec.c

}

}

// no specific error was detected, but no matching combination was found

if (slot_count < 1) {

Check warning

Code scanning / CodeQL

Comparison result is always the same Warning

Comparison is always true because slot_count <= -1.

sam-maloney added 7 commits February 14, 2025 07:49

libjob: change version check error to warning

58af246

Problem: version check in jj.c triggers fatal error if version != 1 which will be needed for resource ranges. Change the error to instead be a warning that version numbers other than 1 are unsupported, but execution will continue.

testsuite: modify version number checks in libjob

894faec

Problem: now that version numbers other than 1 continue to execute in libjob, several tests fail that expect such jobs to terminate. Modify the tests to expect successful completion, and detect new warning message in t0022.

shell: change version check error to warning

d5c4746

Problem: version check in jobspec.c triggers fatal error if version!=1 which will be needed for resource ranges. Change the error to instead be a warning that version numbers other than 1 are unsupported, but execution will continue.

sam-maloney force-pushed the nonV1-jobspec branch from b3d55c2 to 216a80e Compare February 14, 2025 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: enable RFC 14 resource ranges #6632

WIP: enable RFC 14 resource ranges #6632

sam-maloney commented Feb 13, 2025

WIP: enable RFC 14 resource ranges #6632

Are you sure you want to change the base?

WIP: enable RFC 14 resource ranges #6632

Conversation

sam-maloney commented Feb 13, 2025