-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonical representation of resources #237
Comments
As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched. Since R would no longer be human readable, we'd need to develop some tools to display and operate on R as well... |
My initial proposal would be just to use the format under I will post a simple example that contains cluster, node and cores as currently used by flux-core. This limitation can serve as the V1 of this canonical format.
|
We discussed this a bit at today's meeting: Is the canonical representation of R sufficiently good enough to be targeted by various writers/generators? A writer/generate will produce an R and expect that the rest of the system will more or less just work:
Then, other services will be serviced off of the R. Many will use manipulation libraries though. A comment was made that JGF likely to work but a harder part will be how to map execution targets to R. |
@dongahn, are there instructions for having flux-sched generate R for jobs which contain |
It would have been a bit nicer since it has some fixes, but please look at: https://github.com/flux-framework/flux-sched/blob/2c3b9ec75139f408f75ac3963b77c087598c27d6/t/t1006-recovery-full.t#L28 Load options ( I was planning to spend some time for this as well next week. So this is great timing. |
You should also be able to change the match emit format though resource's rc1 script: FLUXION_RESOURCE_OPTIONS="match-format=rv1 load-whitelist=node,core,gpu" |
If you want to look at this for more advanced graph representations, please consider using
Example GRUG files including things like multi-tiered storage configurations: https://github.com/flux-framework/flux-sched/blob/master/t/t3020-resource-mtl2.t#L9 |
Thanks! I was able to do:
For my own benefit, here's an example rv1 for a 2-core allocation in a docker container ƒ(s=1,d=0) fluxuser@428d6d454f60:~$ flux job info 646258360320 R | jq
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0",
"node": "428d6d454f60",
"children": {
"core": "2-3"
}
}
],
"starttime": 1589816042,
"expiration": 1590420842
},
"scheduling": {
"graph": {
"nodes": [
{
"id": "7",
"metadata": {
"type": "core",
"basename": "core",
"name": "core2",
"id": 2,
"uniq_id": 7,
"rank": 0,
"exclusive": true,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0/core2"
}
}
},
{
"id": "9",
"metadata": {
"type": "core",
"basename": "core",
"name": "core3",
"id": 3,
"uniq_id": 9,
"rank": 0,
"exclusive": true,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0/core3"
}
}
},
{
"id": "2",
"metadata": {
"type": "socket",
"basename": "socket",
"name": "socket0",
"id": 0,
"uniq_id": 2,
"rank": 0,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60/socket0"
}
}
},
{
"id": "1",
"metadata": {
"type": "node",
"basename": "428d6d454f60",
"name": "428d6d454f60",
"id": -1,
"uniq_id": 1,
"rank": 0,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0/428d6d454f60"
}
}
},
{
"id": "0",
"metadata": {
"type": "cluster",
"basename": "cluster",
"name": "cluster0",
"id": 0,
"uniq_id": 0,
"rank": -1,
"exclusive": false,
"unit": "",
"size": 1,
"paths": {
"containment": "/cluster0"
}
}
}
],
"edges": [
{
"source": "2",
"target": "7",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "2",
"target": "9",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "1",
"target": "2",
"metadata": {
"name": {
"containment": "contains"
}
}
},
{
"source": "0",
"target": "1",
"metadata": {
"name": {
"containment": "contains"
}
}
}
]
}
}
}
|
Great! Note that the
How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires? |
sched-simple does not use hwloc data directly, but instead reads the aggregated information from
Of course JGF has more than enough information in it to be used by the simple scheduler.
I thought we were proposing an Rv2 where the format was JGF? |
FYI -- JGF reader code in flux-sched is https://github.com/flux-framework/flux-sched/blob/master/resource/readers/resource_reader_jgf.hpp, which reads this and updates the graph data store. It not only update the spatial schema of vertices and edges but also scheduler metadata, though. @milroy has algorithms and code that can also grow the graph data store using a new JGF, which is the current topic for our cluster submission. The emitted JGF can be fed into Taking the JGF portion from your example and store that into ./resource.json ahn1@49674596c035:/usr/src/resource/utilities$ flux mini run --dry-run -n 1 hostname > jobspec.json
ahn1@49674596c035:/usr/src/resource/utilities$ ./resource-query -L resource.json -f jgf -F pretty_simple
INFO: Loading a matcher: CA
resource-query> match allocate jobspec.json
---cluster0[1:shared]
------428d6d454f60[1:shared]
---------socket0[1:shared]
------------core3[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: ============================= |
I see. Is What does this look like when each rank's resource set is different? We may not have this case yet though. |
Where will Rv2 be used? Will the |
No, but
There is an idset entry for each set of rank or ranks that have different summary information. |
I thought Rv2 was going to be our next step towards a "canonical" resource set representation. I think you summarized it well in the comment above. As canonical representation, R would be the common resource set serialization used by all Flux components that transmit and share resource sets
Sorry if the above is obvious... |
One simple idea would be to allow something between R_lite and full JGF by allowing JGF "nodes" to represent multiple identical resources. E.g. something along the lines of: {
"nodes": [
{
"id": "0",
"metadata": {
"basename": "fluke",
"exclusive": false,
"ids": "60-63",
"ranks": "[60-63]",
"type": "node",
},
},
{
"id": "1",
"metadata": {
"basename": "core",
"exclusive": true,
"ids": "[0-3]",
"size": 4,
"type": "core",
}
}],
"edges": [
{
"metadata": {
"name": {
"containment": "contains"
}
},
"source": "0",
"target": "1"
}]
} Would something like this be feasible I wonder? Edit: I left out "cluster" and "socket" resources just to make the example readable. Also, removed the "paths" in the serialization because it seems like these can be computed when unpacking serialized graph, so is it really necessary to duplicate in the serialization format? |
Yes, that's one example of compression schemes: flux-framework/flux-sched#526 One thing that I don't know is whether applying an ad hoc compression to the representation itself would condense the RV2 better or applying a general compression on the 'canonical' representation would result better. |
I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser... |
This kind of "modified JGF" sounds pretty appealing to me. We do already lz4 compress KVS data on the back end, so at least the KVS growth would be mitigated somewhat. |
I like this direction. Just as a food for thoughts: In terms of serialization and deserialization needs:
Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what
I think the trade-off space is
Probably serialization/deserialization costs would small so IMHO this would be the kvs storage and communication payloads vs. software complexity. Maybe we should play with 1) and 2) a bit to make more progress. I have to think for simple cases this transformation would be straightforward (as I already did something similar for R_lite) but I don't know whether it will be straightforward for more complex case. In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form. |
I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?
I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R? |
Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed. |
Oh yeah, and this would only work for resources at the highest level in the tree, for nodes |
Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded. One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like |
I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained). So we have to think about the loss of information and see if that's okay or not... |
Oh yeah, this should be possible!
Agreed. |
One example where this makes sense is like Corona that will have two different types of nodes (one with 4 GPUs vs. 8 GPUs). |
I am talking about a phase where the proposed condensed JGF will be translated into the original JGF and vice versa. For Fluxion, that may be the first step I may want to take. Another example could be creating RV1 from an external source like Cray end points. You may first want to collect the individual resource info from the external source and dump it into uncondensed JGF and then process it to become the proposed "condensed" RV2. |
Similarly,
I think ids and ranks in generally cannot be condensed cleanly this way? |
If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key. For your example above, the idset for That reminds me that idsets can't actually be used here since we'd need a list. |
Oh, that would be ideal! |
One thing I'm clear on after this discussion today though, Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation... |
Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:
(3.71x and 2.02x respectively; with |
@dongahn could you recompile this sentence with different optimization please? :-) |
Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above). |
Sorry. I guess the point I was trying to make: what should be our canonical representation -- the condensed form or un-condensed form. A compiler analogy: they have the canonical intermediate representation (IR), which then gets compiled down to machine code (actual storage format). |
That's fine. The reason for the testing was just to see the relative advantages of two forms when "compressed". Using your example: The condensed form has x2.29 better in the raw sizes (1587/692). But when compressed with lz4c, the condensed form is only x1.25 better. Since we will likely keep the object compressed, I wasn't sure if this 25% was worthy extra complexity. But like I said the relative advantages would change at larger scale, so my comment: We may want to continue to test our proposed scheme to check the gains. Hope this makes better sense. |
I don't have the right answer here either. BTW, don't get me wrong though. I'm asking all these questions to think this through. Hopefully we can settle on something really cool in the end :-). |
@grondo: Just to confirm, I like the hybrid approach like I said in the last meeting. In a compiler world, there is a difference between canonical vs. non-canonical representations, but we don't have to be too pedantic here. In particular at the system instance, it should be straightforward to emit the "condensed" form either from a resource configuration spec (or other external sources). At this point, I am unclear how easy or difficult for Fluxion to emit the condensed form instead of the fully concretized JGF. But compression and such was the task we need to do anyway, having a target should be helpful. By making full specified JGF as the default representation, we will be able to take a phased approach to learn how to do this properly. Two things:
The proposed form is very similar to GRUG (https://github.com/flux-framework/flux-sched/blob/master/resource/utilities/README.md#recipe-graph-definition). I used that format to specify a recipe to generate a fully concretized JGF. In fact, the first way to support RV2 from the system instance would be to use the new format as another generation recipe.
|
My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?) |
Great idea. If we were to go to this route, I think we should consider explicitly formalizing the relationship between the task slot id space of the parent instance and the execution target ID space of a nested instance. (Augmenting some RFC). The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance. But if the relationship between the task slot id of the parent and the execution target ID space of a nested instance can become explicit and formalized, that would lead to a much efficient implementation, I think. |
How easy or difficult to rewrite to do this annotation directly in the condensed format? |
It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability. My take away so far:
|
This might be best, but I wasn't sure if it was a tractable problem! If we have a way to do it, then like you said the core resource module could annotate execution targets at instance startup in either the case of a system instance or child instance. TBH, I'm not sure exactly the best way to add the execution target annotation to R yet. Would it be best to add a property to an existing resource (vertex), or would it make more sense to treat execution target as a "grouping" vertex (i.e. non-resource vertex). |
Great. Now that the most recent sched-simple PR is in I'm going to try to make some progress on Rv2. In flux-core, we have a lot of users of the R_lite format that will need to transition to Rv2. My idea is to prototype a C API reader of some form of Rv2, and then add a function that can convert to R_lite as a transition tool. Then we can begin to add functionality required by flux-core components ( Once this is working we can then update |
@dongahn, do you have any suggestions on how to do a task slot/execution target annotation to a resource set? It seems like you have to have some way to group resources, so either a tag on every resource in the slot, or would it be better to allow some kind of virtual resource group vertex (similar to how slot is specified in jobspec)? |
Functionality-wide this is tractable. Scalability-wide, we may need more cleverness, think. We have a functionality proof of concept in our old scheduler (version 0.7). I called it This isn't that scalable because only one process does this operation. |
I was imagining to rewrite the "rank" fields for every resource vertex managed by the rank. I'm a bit hesitant to introduce non-physical vertices into the graph as I was burned once by dealing with matching a non-physical "slot" against a graph. I know you are talking about a different thing though. |
Job shell has to map tasks onto task slots anyway, since it is going to start the tasks, so maybe it makes sense to do this mapping there. Perhaps the annotation could be done in its own graph (other than the containment or default graph in the JGF). As such, it could actually be kept as a separate document, or at least easily added/extracted from an existing R. These annotation graphs might just consist of new edges. I guess the drawback would be a potentially large number of duplicated edges. |
FWIW, Fluxion has its R_lite emitter, going from a fully concretized graph object to Are you thinking about unpacking the condensed JGF format into a graph object and emit There got to be a good C-based lightweight graph library as well. |
Yes, it is probably most straightforward to just have a numeric identifier propagated to all vertices that are managed by that execution target. We may want to update terminology to refer to exec_target instead of rank. I wonder if it makes sense for a resource vertex to list multiple exec_target IDs. E.g. the root vertex of a hierarchy would list all execution target IDs, next level (say "switch") would list all execution target IDs below the switch, and so on. This might make a search for a certain execution target easier, as well as fetching something like the R_local for. a given exec ID. |
I think in flux-core we would like to ignore graph and focus only on what you term the "containment" hiearchy. So we would unpack JGF containment hierarchy into a simple tree and then emit R_lite from that. This is meant to be a bridge approach to get us one step forward. |
Yes, I thought this was the main merit (scalability) and I like this direction!
Very interesting idea. Maybe one way to view this is: Each rank creates a mini hwloc-like documentation per each task slot so that a nested rank can find its resource portion from its instance-wide |
This is also somewhat going back to our original question of why a scheduler needs to know about execution targets. Ideally we desired the scheduler to find the overall resource shape and executors take the intersection between this shape and its (locally) managing resources whether they should execute tasks or not. In short of it, we created the But if we make schedulers execution-target agnostic, we may need some other module or library to do this intersection. |
This is reasonable and should be able to support short- to mid-term use cases. The emitter logic within Fluxion is doing depth-first-visit on a graph which is equivalent to tree depth first visit when the graph topology is a tree. Just think about how flux-core will detect and handle when the topology of JGF is more than a graph like DAG which can be generated out of Fluxion or manually crafted. |
(Or maybe this is Maybe one way to make progress is to flesh out some details about this mini document so that it can be "the key" to locate the corresponding subgraph within the overall instance RV2 graph. |
I don't think schedulers can be execution target agnostic at this point. This is because the flux-core
Because the flux-core Perhaps when (corollary: if a misconfigured execution target is brought online, the matching scheme won't work since by definition the |
Lots of good conversation in here. I attempted to skim and the highlights that I caught are
There still seems to be an open question on how to handle execution targets. One suggestion was to add a property to each vertex called "exec_target" (similar to "rank") that would be rewritten by a job-shell plugin before being passed to child instances: #237 (comment) and #237 (comment). There was also a suggestion that the execution target info could be its own graph: #237 (comment). One conclusion did seem to be that schedulers cannot be execution target agnostic: #237 (comment) At this point, my suggestion would be that we close this issue and spin out the remaining TODOs into separate issues:
Any other TODOs or thoughts? |
This was discussed at flux-framework/flux-core#2908 (comment) where a need for unifying static config, R and other sources (hwloc and vendor specific resource discovery services) into a canonical resource representation like JSON Graph Format (JGF) was expressed.
From @grondo:
General worries about attempting to design "do everything" formats there.
But the goal is lofty and he is willing.
From @dongahn:
The text was updated successfully, but these errors were encountered: