Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical representation of resources #237

Open
dongahn opened this issue Apr 24, 2020 · 61 comments
Open

Canonical representation of resources #237

dongahn opened this issue Apr 24, 2020 · 61 comments

Comments

@dongahn
Copy link
Member

dongahn commented Apr 24, 2020

This was discussed at flux-framework/flux-core#2908 (comment) where a need for unifying static config, R and other sources (hwloc and vendor specific resource discovery services) into a canonical resource representation like JSON Graph Format (JGF) was expressed.

From @grondo:
General worries about attempting to design "do everything" formats there.

It typically ends up making things more difficult for the simple cases, and inevitably impossible for some complex case you didn't at first consider. The idea of parsing global XML to translate it to JGF just to read "You have 4 cores" seems like a lot of churn for a high throughput case as an example.

But the goal is lofty and he is willing.

From @dongahn:

As I see where we are headed for high ends, more complex cases will come to our way much quicker than you would think. (e.g., multi-tiered storage support etc).
Ways to statically configure a system will also have to change. (towards higher complexity) And there will likely be multiple ways.
Also very likely, we will also have to deal with different ways to populate R (now hwloc; but later vendor-specific external services to discovery global storage resources...)
Yet, we have to advance not only flux-core but also other components to keep bread of these changes.
It seemed this was too high of complexity to deal with an ad-hoc fashion.
Now, having the canonical jobspec was very helpful to make progress at different paces between flux-core and -sched and it feels like we can benefit from a similar arrangement. Having a full blown target representation first and slowly build up partial implementations.
Also we have lots of experience with JGF with multiple efforts around it. It felt like it makes sense to leverage them as well.

@grondo
Copy link
Contributor

grondo commented Apr 24, 2020

As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched. Since R would no longer be human readable, we'd need to develop some tools to display and operate on R as well...

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2020

My initial proposal would be just to use the format under scheduling key of https://github.com/flux-framework/rfc/blob/master/spec_20.rst as our canonical format.

I will post a simple example that contains cluster, node and cores as currently used by flux-core. This limitation can serve as the V1 of this canonical format.

flux-sched also has the code to parse it so we may borrow from that code for the straw man.

@dongahn
Copy link
Member Author

dongahn commented May 7, 2020

We discussed this a bit at today's meeting:

Is the canonical representation of R sufficiently good enough to be targeted by various writers/generators? A writer/generate will produce an R and expect that the rest of the system will more or less just work:

  • hwloc -> R
  • configuration file -> R
  • external advanced auto-discovery (like Cray end points) -> R

Then, other services will be serviced off of the R. Many will use manipulation libraries though.

A comment was made that JGF likely to work but a harder part will be how to map execution targets to R.

@grondo
Copy link
Contributor

grondo commented May 17, 2020

@dongahn, are there instructions for having flux-sched generate R for jobs which contain scheduler key and JGF representation of resources? I'd like to generate samples of that format for study.

@dongahn
Copy link
Member Author

dongahn commented May 17, 2020

@grondo:

It would have been a bit nicer since it has some fixes, but please look at: https://github.com/flux-framework/flux-sched/blob/2c3b9ec75139f408f75ac3963b77c087598c27d6/t/t1006-recovery-full.t#L28

Load options (match-format=rv1) should allow the fluxion-resource to generate the full rv1 instead of rv1_nosched which omit the JGF key.

I was planning to spend some time for this as well next week. So this is great timing.

@dongahn
Copy link
Member Author

dongahn commented May 17, 2020

You should also be able to change the match emit format though resource's rc1 script:

FLUXION_RESOURCE_OPTIONS="match-format=rv1 load-whitelist=node,core,gpu"

@dongahn
Copy link
Member Author

dongahn commented May 17, 2020

If you want to look at this for more advanced graph representations, please consider using resource-query as well. It has the same emit options as an cli option.

    -F, --match-format=<simple|pretty_simple|jgf|rlite|rv1|rv1_nosched>
            Specify the emit format of the matched resource set.
            (default=simple).

Example GRUG files including things like multi-tiered storage configurations:

https://github.com/flux-framework/flux-sched/blob/master/t/t3020-resource-mtl2.t#L9

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Thanks! I was able to do:

$ flux module reload resource match-format=rv1

For my own benefit, here's an example rv1 for a 2-core allocation in a docker container

ƒ(s=1,d=0) fluxuser@428d6d454f60:~$ flux job info 646258360320 R | jq
{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "node": "428d6d454f60",
        "children": {
          "core": "2-3"
        }
      }
    ],
    "starttime": 1589816042,
    "expiration": 1590420842
  },
  "scheduling": {
    "graph": {
      "nodes": [
        {
          "id": "7",
          "metadata": {
            "type": "core",
            "basename": "core",
            "name": "core2",
            "id": 2,
            "uniq_id": 7,
            "rank": 0,
            "exclusive": true,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0/core2"
            }
          }
        },
        {
          "id": "9",
          "metadata": {
            "type": "core",
            "basename": "core",
            "name": "core3",
            "id": 3,
            "uniq_id": 9,
            "rank": 0,
            "exclusive": true,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0/core3"
            }
          }
        },
        {
          "id": "2",
          "metadata": {
            "type": "socket",
            "basename": "socket",
            "name": "socket0",
            "id": 0,
            "uniq_id": 2,
            "rank": 0,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60/socket0"
            }
          }
        },
        {
          "id": "1",
          "metadata": {
            "type": "node",
            "basename": "428d6d454f60",
            "name": "428d6d454f60",
            "id": -1,
            "uniq_id": 1,
            "rank": 0,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0/428d6d454f60"
            }
          }
        },
        {
          "id": "0",
          "metadata": {
            "type": "cluster",
            "basename": "cluster",
            "name": "cluster0",
            "id": 0,
            "uniq_id": 0,
            "rank": -1,
            "exclusive": false,
            "unit": "",
            "size": 1,
            "paths": {
              "containment": "/cluster0"
            }
          }
        }
      ],
      "edges": [
        {
          "source": "2",
          "target": "7",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "2",
          "target": "9",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "1",
          "target": "2",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        },
        {
          "source": "0",
          "target": "1",
          "metadata": {
            "name": {
              "containment": "contains"
            }
          }
        }
      ]
    }
  }
}

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Great!

Note that the scheduling key has much more detailed information than R_lite key even for two-core allocation. So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF. In general, though, compared to how hwloc represents its resources (i.e., xml exportable), this would be lighter though.

As a start, let's propose a very simple JGF version of R and do a straw man integration with something like simple-sched.

How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?

@grondo
Copy link
Contributor

grondo commented May 18, 2020

How does sched-simple use hwloc data? Would it be straightforward to create an interface such that it can turn this form into what sched-simple requires?

sched-simple does not use hwloc data directly, but instead reads the aggregated information from resource.hwloc.by_rank, which is a flattened and very condensed list of resources (especially when all ranks are the same)

ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank
{"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}

Of course JGF has more than enough information in it to be used by the simple scheduler.

So things like high throughput case, I still want to specialize the scheduler behavior and omit JGF.

I thought we were proposing an Rv2 where the format was JGF?

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

FYI --

JGF reader code in flux-sched is https://github.com/flux-framework/flux-sched/blob/master/resource/readers/resource_reader_jgf.hpp, which reads this and updates the graph data store. It not only update the spatial schema of vertices and edges but also scheduler metadata, though.

@milroy has algorithms and code that can also grow the graph data store using a new JGF, which is the current topic for our cluster submission.

The emitted JGF can be fed into resource-query and used for further scheduling as well.

Taking the JGF portion from your example and store that into ./resource.json

ahn1@49674596c035:/usr/src/resource/utilities$ flux mini run --dry-run -n 1 hostname > jobspec.json
ahn1@49674596c035:/usr/src/resource/utilities$ ./resource-query -L resource.json -f jgf -F pretty_simple
INFO: Loading a matcher: CA
resource-query> match allocate jobspec.json
      ---cluster0[1:shared]
      ------428d6d454f60[1:shared]
      ---------socket0[1:shared]
      ------------core3[1:exclusive]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

ƒ(s=64,d=0) fluxuser@16ea7ed726d5:~$ flux kvs get resource.hwloc.by_rank
{"[0-63]": {"Package": 1, "Core": 4, "PU": 4, "cpuset": "0-3"}}

I see.

Is Package used by sched-simple though?
Is cpuset the inset of core IDs or PU IDs?

What does this look like when each rank's resource set is different? We may not have this case yet though.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

I thought we were proposing an Rv2 where the format was JGF?

Where will Rv2 be used? Will the resource module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Is Package used by sched-simple though?

No, but by_rank is the aggregate of hwloc data and other hwloc objects are summarized for informational purposes. I don't think this format was ever meant to be used long-term though.

What does this look like when each rank's resource set is different? We may not have this case yet though.

There is an idset entry for each set of rank or ranks that have different summary information.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Where will Rv2 be used? Will the resource module use the resource set of its instance from it to satisfy its query? If that's the case, does the R of each job include the full JGF or only those jobs that will spawn a new flux instance need it?

I thought Rv2 was going to be our next step towards a "canonical" resource set representation. I think you summarized it well in the comment above.

As canonical representation, R would be the common resource set serialization used by all Flux components that transmit and share resource sets

  • flux-core resource module would use R during "discovery", e.g. fetch R from parent or use hwloc data to generate an R.
  • Front end tools would ingest this R to e.g. display resources used by running or past jobs, query and display which resources are up/down/ and allocated/free.
  • All Flux schedulers would emit and ingest this R as a common interchange format.

Sorry if the above is obvious...

@grondo
Copy link
Contributor

grondo commented May 18, 2020

One simple idea would be to allow something between R_lite and full JGF by allowing JGF "nodes" to represent multiple identical resources. E.g. something along the lines of:

{
    "nodes": [
       {
          "id": "0",
          "metadata": {
            "basename": "fluke",
            "exclusive": false,
            "ids": "60-63",
            "ranks": "[60-63]",
            "type": "node",
          },
        },
        {
          "id": "1",
          "metadata": {
            "basename": "core",
            "exclusive": true,
            "ids": "[0-3]",
            "size": 4,
            "type": "core",
          }
        }],
      "edges": [
        {
          "metadata": {
            "name": {
              "containment": "contains"
            }
          },
          "source": "0",
          "target": "1"
        }]
}

Would something like this be feasible I wonder?

Edit: I left out "cluster" and "socket" resources just to make the example readable. Also, removed the "paths" in the serialization because it seems like these can be computed when unpacking serialized graph, so is it really necessary to duplicate in the serialization format?

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Yes, that's one example of compression schemes: flux-framework/flux-sched#526

One thing that I don't know is whether applying an ad hoc compression to the representation itself would condense the RV2 better or applying a general compression on the 'canonical' representation would result better.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

I would say both would probably be best. Even if you use general compression (by this I assume you mean something like gzip), there is some benefit to ad-hoc compression by decreasing the size of the JSON object ingested into the JSON parser...

@garlick
Copy link
Member

garlick commented May 18, 2020

This kind of "modified JGF" sounds pretty appealing to me.

We do already lz4 compress KVS data on the back end, so at least the KVS growth would be mitigated somewhat.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

I like this direction. Just as a food for thoughts:

In terms of serialization and deserialization needs:

1) RV2  <--> JGF (un-condensed JGF) <--> Graph-like object to query and modify on 

Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource would do.

2) RV2  <--> Graph-like object to query and modify on

I think the trade-off space is

  • functionality: can we do serialization/deserialization across all these transformations without loss of information and how easy or difficult to do this for all use cases
  • overheads: saving in the kvs storage and communication payloads vs. serialization/deserialization costs.

Probably serialization/deserialization costs would small so IMHO this would be the kvs storage and communication payloads vs. software complexity.

Maybe we should play with 1) and 2) a bit to make more progress. I have to think for simple cases this transformation would be straightforward (as I already did something similar for R_lite) but I don't know whether it will be straightforward for more complex case.

In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

In terms of loss of information, compressed RV2 won't give uncondensed JGF unique vertex/edge IDs. I don't know if that's detrimental or not. Need to think some more whether there could some critical information that cannot be captured with the condensed form.

I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?

Depending on the implementation, the uncondensed JGF can be omitted, which could very well be what resource would do.

I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

I had thought about this, but since JGF node "id" is also a string, could this be replaced with an idset as well?

Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Yes, we can do this. But because id sequence will not be same as resource id sequence (e.g., core[0-35]), idset will not be well compressed.

Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15] sharing child sockets [0-1] there are actually 32 unique socket resources, not just 2.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.

One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1 uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Oh yeah, and this would only work for resources at the highest level in the tree, for nodes [0-15] sharing child sockets [0-1] there are actually 32 unique socket resources, not just 2.

I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).

So we have to think about the loss of information and see if that's okay or not...

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Could the containment "path" be used as a stand-in for a unique identifier for all resources? This could be computed after a compressed JGF is expanded.

Oh yeah, this should be possible!

One of the benefits of the unique identifier is so that an R used in a sub-instance several levels deep within the Flux instance hierarchy can relate its resources directly to any of its parents, including the original system instance. At first we had assigned uuids to each resource to enable this, but it seems like the containment path like /cluster0/node8/socket0/core1 uniquely identifies resources, as long as interior resource nodes are never pruned when creating R for jobs.

Agreed.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

(or make the condensed node more fine-grained)

One example where this makes sense is like Corona that will have two different types of nodes (one with 4 GPUs vs. 8 GPUs).

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

I think I'm stil a bit lost. If the implementation of RV2 is JGF, I'm not sure what you mean by omitting it. Are you considering one option is that JGF remains an optional part of R?

I am talking about a phase where the proposed condensed JGF will be translated into the original JGF and vice versa.

For Fluxion, that may be the first step I may want to take.

Another example could be creating RV1 from an external source like Cray end points.

You may first want to collect the individual resource info from the external source and dump it into uncondensed JGF and then process it to become the proposed "condensed" RV2.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

I think, in general, you can choose only a single compression criteria (like local resource's local id core[0-35]) at each level of resource hierarchy and if a resource has a per-resource field that cannot be compressed with that same criteria (e.g., uniq_id, uuid, properties whatever), you can't include them in the condensed JGF (or make the condensed node more fine-grained).

Similarly,

{
          "id": "0",
          "metadata": {
            "basename": "fluke",
            "exclusive": false,
            "ids": "60-63",
            "ranks": "[60-63]",
            "type": "node",
          },
        },

I think ids and ranks in generally cannot be condensed cleanly this way?

@grondo
Copy link
Contributor

grondo commented May 18, 2020

I think ids and ranks in generally cannot be condensed cleanly this way?

If there are the same number of values for each key, then you can condense I would assume, though perhaps not cleanly. You would have to "condense" on a primary key, say "ids", then have some standard way of generating the other condensed keys based on either the index or the value of primary key.

For your example above, the idset for ids and ranks would be required to have the same size, and during expansion as you "pop" each id you would pop its rank from the ranks set.

That reminds me that idsets can't actually be used here since we'd need a list.

@grondo
Copy link
Contributor

grondo commented May 18, 2020

FWIW, when I gave some thoughts to it (flux-framework/flux-sched#526 (comment)), an insight I got was -- it would be best if other keys can be expressed as some regular function of the primary key...

Oh, that would be ideal!

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

One thing I'm clear on after this discussion today though,

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

@garlick
Copy link
Member

garlick commented May 18, 2020

Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:

$ lz4c jtest.json.txt
Compressed filename will be : jtest.json.txt.lz4 
Compressed 1587 bytes into 427 bytes ==> 26.91%                                
$ lz4c prop.json.txt
Compressed filename will be : prop.json.txt.lz4 
Compressed 692 bytes into 341 bytes ==> 49.28%   

(3.71x and 2.02x respectively; with lz4c -9, I get 4.42x and 2.26x)

@garlick
Copy link
Member

garlick commented May 18, 2020

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

@dongahn could you recompile this sentence with different optimization please? :-)

@grondo
Copy link
Contributor

grondo commented May 18, 2020

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Should our canonical resource set representation would be the original uncondensed JGF and the various condensing optimization should be a raw storage or data layout or the condensed representation itself should our canonical representation...

@dongahn could you recompile this sentence with different optimization please? :-)

Sorry. I guess the point I was trying to make: what should be our canonical representation -- the condensed form or un-condensed form. A compiler analogy: they have the canonical intermediate representation (IR), which then gets compiled down to machine code (actual storage format).

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Just a reminder that flux-core already depends on liblz4. I'm not sure it's clear it will be a win to trade computation/extra complexity for message size, but if we do go that way, I prefer we not take on another compression library dependency. lz4 does pretty well anyway:

That's fine. The reason for the testing was just to see the relative advantages of two forms when "compressed".

Using your example:

The condensed form has x2.29 better in the raw sizes (1587/692). But when compressed with lz4c, the condensed form is only x1.25 better. Since we will likely keep the object compressed, I wasn't sure if this 25% was worthy extra complexity. But like I said the relative advantages would change at larger scale, so my comment:

We may want to continue to test our proposed scheme to check the gains.

Hope this makes better sense.

@dongahn
Copy link
Member Author

dongahn commented May 18, 2020

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

I don't have the right answer here either. BTW, don't get me wrong though. I'm asking all these questions to think this through. Hopefully we can settle on something really cool in the end :-).

@dongahn
Copy link
Member Author

dongahn commented May 24, 2020

Here's my initial thought, though I don't claim to have the right answer. Fully specified JGF should be the default canonical representation. However, a simplified, condensed version (also valid JGF) should be allowed where there is no information loss. (Something simple and obvious like above).

@grondo: Just to confirm, I like the hybrid approach like I said in the last meeting. In a compiler world, there is a difference between canonical vs. non-canonical representations, but we don't have to be too pedantic here.

In particular at the system instance, it should be straightforward to emit the "condensed" form either from a resource configuration spec (or other external sources).

At this point, I am unclear how easy or difficult for Fluxion to emit the condensed form instead of the fully concretized JGF. But compression and such was the task we need to do anyway, having a target should be helpful. By making full specified JGF as the default representation, we will be able to take a phased approach to learn how to do this properly.

Two things:

  1. We may need to specify edge types. The containment edge in [here]9Canonical representation of resources #237 (comment)) is a multiplicative edge, which means for each specified child vertex has an edge to each specified child vertex. But there are cases where we need an associated edge: via an edge a specified vertex is associated with another specified vertex.

The proposed form is very similar to GRUG (https://github.com/flux-framework/flux-sched/blob/master/resource/utilities/README.md#recipe-graph-definition). I used that format to specify a recipe to generate a fully concretized JGF. In fact, the first way to support RV2 from the system instance would be to use the new format as another generation recipe.

  1. This issue isn't specific to RV2. But did you think about how to remap RV1 to the execution targets in a nested instance name space?

@grondo
Copy link
Contributor

grondo commented May 26, 2020

This issue isn't specific to RV2. But did you think about how to remap RV1 to the execution targets in a nested instance name space?

My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

My only idea here is to have the exec system emit an R that is annotated with the assigned task slots. A child instance can reasonably assume that task slot ids directly map to broker ranks. (Actually writing that, maybe it is the job shell that would need to annotate R?)

Great idea.

If we were to go to this route, I think we should consider explicitly formalizing the relationship between the task slot id space of the parent instance and the execution target ID space of a nested instance. (Augmenting some RFC).

The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.

But if the relationship between the task slot id of the parent and the execution target ID space of a nested instance can become explicit and formalized, that would lead to a much efficient implementation, I think.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

How easy or difficult to rewrite to do this annotation directly in the condensed format?

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

@grondo:

It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.

My take away so far:

  1. The fully concretized JGF is our default canonical resource set and we extend it to support a condensed form (like you proposed up there) as well.
  2. Investigate ways to emit hwloc info into RV2: our scheduler already knows how to do this in the fully concretized JGF so we only need a feasibility of this for the condensed form.
  3. Investigate ways for Fluxion to emit a condensed format (this would require multiple steps)
  4. Investigate ways to rewrite an RV2 object with slot ids for nested instance support
  5. @SteVwonder may want to test whether we can emit externally gathered multi-tiered storage enabled system configurations into the condensed RV2 (I have some ideas about how to formulate some good tests).

@grondo
Copy link
Contributor

grondo commented May 27, 2020

The other idea I was thinking about was for the nested instance to go through a "remap" step by comparing its overall RV2 with per execution target hwloc info. This would be similar to what you might do at the system instance.

This might be best, but I wasn't sure if it was a tractable problem! If we have a way to do it, then like you said the core resource module could annotate execution targets at instance startup in either the case of a system instance or child instance.

TBH, I'm not sure exactly the best way to add the execution target annotation to R yet. Would it be best to add a property to an existing resource (vertex), or would it make more sense to treat execution target as a "grouping" vertex (i.e. non-resource vertex).

@grondo
Copy link
Contributor

grondo commented May 27, 2020

It feels like we have some good verbal exchanges so far, and maybe we can start a (simplified) strawman RFC for RV2 and doing some prototyping to test its viability.

Great. Now that the most recent sched-simple PR is in I'm going to try to make some progress on Rv2.

In flux-core, we have a lot of users of the R_lite format that will need to transition to Rv2. My idea is to prototype a C API reader of some form of Rv2, and then add a function that can convert to R_lite as a transition tool.

Then we can begin to add functionality required by flux-core components (resource, job-exec, job-info and sched-simple modules, as well as the job shell), and transition these components to the new library, allowing underlying R format to change or be updated without breaking core.

Once this is working we can then update resource module to use Rv2 in the acquire protocol, which would allow us to break our dependence on all ranks being "online" before the acquire first response.

@grondo
Copy link
Contributor

grondo commented May 27, 2020

@dongahn, do you have any suggestions on how to do a task slot/execution target annotation to a resource set? It seems like you have to have some way to group resources, so either a tag on every resource in the slot, or would it be better to allow some kind of virtual resource group vertex (similar to how slot is specified in jobspec)?

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

This might be best, but I wasn't sure if it was a tractable problem!

Functionality-wide this is tractable. Scalability-wide, we may need more cleverness, think.

We have a functionality proof of concept in our old scheduler (version 0.7). I called it link since we link a rankless RDL-generated resource object to a rank by matching the resource signature between RDL and hwloc objects. (I used a simple match criteria but this can be improved.) But considering the nested system, this should be called map or remap operation.

This isn't that scalable because only one process does this operation.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

TBH, I'm not sure exactly the best way to add the execution target annotation to R yet. Would it be best to add a property to an existing resource (vertex), or would it make more sense to treat execution target as a "grouping" vertex (i.e. non-resource vertex).

I was imagining to rewrite the "rank" fields for every resource vertex managed by the rank. I'm a bit hesitant to introduce non-physical vertices into the graph as I was burned once by dealing with matching a non-physical "slot" against a graph. I know you are talking about a different thing though.

@grondo
Copy link
Contributor

grondo commented May 27, 2020

This isn't that scalable because only one process does this operation.

Job shell has to map tasks onto task slots anyway, since it is going to start the tasks, so maybe it makes sense to do this mapping there.

Perhaps the annotation could be done in its own graph (other than the containment or default graph in the JGF). As such, it could actually be kept as a separate document, or at least easily added/extracted from an existing R. These annotation graphs might just consist of new edges. I guess the drawback would be a potentially large number of duplicated edges.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

In flux-core, we have a lot of users of the R_lite format that will need to transition to Rv2. My idea is to prototype a C API reader of some form of Rv2, and then add a function that can convert to R_lite as a transition tool.

FWIW, Fluxion has its R_lite emitter, going from a fully concretized graph object to R_lite.

Are you thinking about unpacking the condensed JGF format into a graph object and emit R_Lite? Or not introducing a graph code at all?

There got to be a good C-based lightweight graph library as well.

@grondo
Copy link
Contributor

grondo commented May 27, 2020

I was imagining to rewrite the "rank" fields for every resource vertex managed by the rank.

Yes, it is probably most straightforward to just have a numeric identifier propagated to all vertices that are managed by that execution target. We may want to update terminology to refer to exec_target instead of rank.

I wonder if it makes sense for a resource vertex to list multiple exec_target IDs. E.g. the root vertex of a hierarchy would list all execution target IDs, next level (say "switch") would list all execution target IDs below the switch, and so on. This might make a search for a certain execution target easier, as well as fetching something like the R_local for. a given exec ID.

@grondo
Copy link
Contributor

grondo commented May 27, 2020

Are you thinking about unpacking the condensed JGF format into a graph object and emit R_Lite? Or not introducing a graph code at all?

I think in flux-core we would like to ignore graph and focus only on what you term the "containment" hiearchy. So we would unpack JGF containment hierarchy into a simple tree and then emit R_lite from that.

This is meant to be a bridge approach to get us one step forward.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

Job shell has to map tasks onto task slots anyway, since it is going to start the tasks, so maybe it makes sense to do this mapping there.

Yes, I thought this was the main merit (scalability) and I like this direction!

Perhaps the annotation could be done in its own graph (other than the containment or default graph in the JGF). As such, it could actually be kept as a separate document, or at least easily added/extracted from an existing R. These annotation graphs might just consist of new edges. I guess the drawback would be a potentially large number of duplicated edges.

Very interesting idea.

Maybe one way to view this is:

Each rank creates a mini hwloc-like documentation per each task slot so that a nested rank can find its resource portion from its instance-wide RV2 object. If this can be done, we will be able to wean off of having to recursively fetch hwloc at every instance.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

Each rank creates a mini hwloc-like documentation per each task slot so that a nested rank can find its resource portion from its instance-wide RV1 object. If this can be done, we will be able to wean off of having to recursively fetch hwloc at every instance.

This is also somewhat going back to our original question of why a scheduler needs to know about execution targets. Ideally we desired the scheduler to find the overall resource shape and executors take the intersection between this shape and its (locally) managing resources whether they should execute tasks or not.

In short of it, we created the R_lite structure for the scheduler to tell certain executors, "hey run this." as a bridge.

But if we make schedulers execution-target agnostic, we may need some other module or library to do this intersection.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

I think in flux-core we would like to ignore graph and focus only on what you term the "containment" hiearchy. So we would unpack JGF containment hierarchy into a simple tree and then emit R_lite from that.

This is meant to be a bridge approach to get us one step forward.

This is reasonable and should be able to support short- to mid-term use cases. The emitter logic within Fluxion is doing depth-first-visit on a graph which is equivalent to tree depth first visit when the graph topology is a tree.

Just think about how flux-core will detect and handle when the topology of JGF is more than a graph like DAG which can be generated out of Fluxion or manually crafted.

@dongahn
Copy link
Member Author

dongahn commented May 27, 2020

a mini hwloc-like documentation

(Or maybe this is R_local).

Maybe one way to make progress is to flesh out some details about this mini document so that it can be "the key" to locate the corresponding subgraph within the overall instance RV2 graph.

@grondo
Copy link
Contributor

grondo commented May 28, 2020

But if we make schedulers execution-target agnostic, we may need some other module or library to do this intersection.

I don't think schedulers can be execution target agnostic at this point. This is because the flux-core resource module only monitors at the granularity of execution target, and thus reports resources "up" and "down" by execution target id.

(Or maybe this is R_local).

Maybe one way to make progress is to flesh out some details about this mini document so that it can be "the key" to locate the corresponding subgraph within the overall instance RV2 graph.

R_local is the subset of R owned by an execution target (or a job-shell or IMP process). As such shouldn't it be in valid R format?

Because the flux-core resource module monitors at the execution target level, any scheduler internal representation of R will need to be execution target aware. I'm not sure there is a way to get around that at present.

Perhaps when resource presents acquired resources to schedulers, the execution targets are already labeled as described above, and the labeling is encoded in the R object read by the scheduler. In that case "rank remapping" is in the domain of the resource module itself. This is required anyway since resource will need to be able to compare configured resources to actual resources as an execution target is brought online.

(corollary: if a misconfigured execution target is brought online, the matching scheme won't work since by definition the R_local won't match any configured resource subset)

@SteVwonder
Copy link
Member

Lots of good conversation in here. I attempted to skim and the highlights that I caught are

There still seems to be an open question on how to handle execution targets. One suggestion was to add a property to each vertex called "exec_target" (similar to "rank") that would be rewritten by a job-shell plugin before being passed to child instances: #237 (comment) and #237 (comment). There was also a suggestion that the execution target info could be its own graph: #237 (comment).

One conclusion did seem to be that schedulers cannot be execution target agnostic: #237 (comment)

At this point, my suggestion would be that we close this issue and spin out the remaining TODOs into separate issues:

  • draft RV2 format RFC
  • create transitional RV2->R_lite reader for flux-core
  • hwloc -> RV2 reader (for the resource module?)
  • close the loop on the execution target property discussion
  • continue developing the condensed format as needed
  • formulate tests for gathering multi-tiered storage system info into RV2

Any other TODOs or thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants