-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA in Kubevirt #331
base: main
Are you sure you want to change the base?
DRA in Kubevirt #331
Conversation
Signed-off-by: Ryan Hallisey <[email protected]>
Signed-off-by: Alay Patel <[email protected]>
Signed-off-by: Alay Patel <[email protected]>
Skipping CI for Draft Pull Request. |
14acb8b
to
bf19e3b
Compare
Signed-off-by: Alay Patel <[email protected]>
design-proposals/dra.md
Outdated
// Name of the GPU device as exposed by a device plugin | ||
Name string `json:"name"` | ||
// DeviceName is the name of the device provisioned by device-plugins | ||
DeviceName string `json:"deviceName,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicefr This is not a new API being added, it is how the existing device-plugins work, see example here. The idea is that if DeviceName is populated, then its a device provisioned by device-plugins, it will be passed to pod.spec.container.requests
. Alternatively the Claim field below is not nil then its a DRA device, it will be passed to pod.spec.container.claims
.
We have an alternate API suggestion that will make this choice to the user much more obvious.
design-proposals/dra.md
Outdated
Name string `json:"name"` | ||
// DeviceName is the name of the device provisioned by device-plugins | ||
DeviceName string `json:"deviceName,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here
This design also assumes that the deviceName will be provided in the ClaimParameters, which requires the DRA drivers | ||
to have a ClaimParameters.spec.deviceName in their spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this description you are distinguishing the name
and the deviceName
which also refer to my previous questions. Maybe it will be clearer to add the difference between the 2 at the beginning
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins | ||
|
||
# Alternate Designs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer the first design since it follow the same schema as the volumes/disks and the network/interfaces where we have the section for the pod resource and how they are mapped to the VM
### Virt launcher changes | ||
|
||
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for avoiding the env variables, I like the status!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.
This is in part to address an architectural difference between DPs and DRA:
- DPs pre-define a static resource which is usually identified by a device model: Eg:
nvidia.com/GP102GL_Tesla_P40
. A VMI spec would reference this name directly for its device and it gives easily inferrable information to the user that the VMI will run on a nvidia Tesla P40 GPU. - In DRA, this information is lost as the resource claim that we provide in our spec masks all details about what kind of device that claim is allocating. A resourceclaim is a dynamic object where you would need to look into its spec to know what piece of hardware you're getting. For the user, this means that they can no longer look into the vmi spec to immediately tell what their VMI is running on.
Given this regression in the user expectation (conscious or unconscious), we should provide atleast the bare-minimum identifiable information about a device like the hardware model etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.
In DRA, once a resource claim is allocated, you'll need to take the following steps to identify what device is allocated:
- look at the vmi spec to see the referenced claim (template or name) and request.
- look at virt-launcher pod status to match the claim from vmi spec and infer the resource claim k8s object
- look at the resource claim k8s object to find out the driver and device names for each request.
- look at the resource slice object for the VMI's node published by the corresponding driver and know all the relevant attributes needed
A PVC needs similar hoops to find the volume information but is a lot less complex than DRA. To tackle this problem, PVC information is currently posted onto the VMI status so consumers (users & kubevirt) can easily find the info about a volume. (See https://pkg.go.dev/kubevirt.io/api/core/v1#PersistentVolumeClaimInfo)
Given that DRA ResourceClaims are modeled after PVCs and we have precedence for publishing volume information in the VMI status to make visilibity/access easier, I believe we should follow the same approach for DRA devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr the particular case that @varunrsekar mentioned in the comment was discussed in the meeting on 10/15 (recording) and documented in detail here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.wjmanqk7qzd9
Additionally, we came across another use-case of why the information is needed in vmi-status:
Usecase Title: Debuggability of VMI in failed state with DRA device
- K8s has a feature of releasing the resource claims once the pod has gone to a terminal state(Failed or Succeeded). This is to allow for expensive resources to be used for other pods
- in case where VMI has failed and the virt-launcher has gone to a Failed state, what hardware was attempted to be provisioned for this VMI is lost
It is documented in depth here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.3z09ita4dzuz
Please let use know if this answers the question about why extending the VMI status with this information is useful to the user
@alaypatel07 the proposal looks great. |
### Virt launcher changes | ||
|
||
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we populate the deviceStatus inormation with an external DRA driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicefr as far as this proposal concerns, the assumption is that drivers publish information needed to generate the right domxml for the device. For instance in case of a gpu, it could either be pciAddress(pGPU), or mdev uuid (vGPU). The drivers need to publish this information in ResourceSlice as attributes https://pkg.go.dev/k8s.io/[email protected]/resource/v1alpha3#BasicDevice.
Virt-controller will find the device for it and put this attribute in the vmi status using the steps mentioned here: https://github.com/kubevirt/community/blob/30a2005d8ef72cbae0d8fb1d818861b14d8a5d88/design-proposals/dra.md#dra-api-for-reading-device-related-information
Virt-launcher will then read the attribute to generate the correct domxml
So there could be two cases here:
- either the device has standard attributes, like pciAddress and uuid which will be supported in tree or
- the device needs additional attributes, for which a sidecar will be needed to read that attribute from vmi status and define it in the domxml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in one of the above comments, will be updating the proposal to remove sidecar scenario due to lack of usecases, updated my previous comment to reflect it.
/cc Please add to the PR description that it's a continuation of #293. |
Answered it inline, the proposal will implement the conversion logic of standard attributes in vmi status (pciAddress and uuid) for other attributes, a extension in form of side car is needed. |
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Signed-off-by: Alay Patel <[email protected]>
design-proposals/dra.md
Outdated
The virt-launcher will have the logic of converting a GPU device into its corresponding domxml. For use-cases that are | ||
not handled in-tree, a sidecar container could be envisioned which will convert the information available in status to | ||
the corresponding domxml. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you please provide a specific use case for the proposed use of a sidecar? It would also help to understand why this configuration cannot be handled directly in the virt-launcher itself.
- Currently, a sidecar can't access the VMI object without changes to the virt-controller. How do you plan to differentiate between DRA-related sidecars and other sidecars?
- I assume that the proposed sidecar would rely on KubeVirt's existing hooking mechanism. It's worth noting that this mechanism is provided on a best-effort basis and is not fully supported. It uses gRPC with a defined API, and rather than adding DRA-specific data to the VMI status, a new API call could be integrated into the existing gRPC used by virt-launcher. This way, the DRA-specific data could be handled only between virt-launcher and the sidecar and avoid replication of the DRA data across multiple components...
- Building a solution based on the current hooking mechanism, or modifying the libvirt XML outside of virt-launcher, is not recommended. The libvirt XML is not a supported external API—only the KubeVirt API is. Please carefully consider how to maintain API compatibility between DRA, libvirt, and KubeVirt. If supporting a sidecar is necessary, we should develop a stable interface between compute and the DRA sidecar containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After last week’s discussion and another offline conversation with @xpivarc, its looks like currently we do not have enough use-cases for allowing users to configure these devices in sidecar. It seems like we would like encourage building the support for these devices in the core, rather than allowing them to be configured by a sidecar, I will modify this proposal to change the language to reflect this
### Virt launcher changes | ||
|
||
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.
### Virt launcher changes | ||
|
||
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.
### Virt launcher changes | ||
|
||
1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml | ||
instead of environment variables as in the case of device-plugins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.
design-proposals/dra.md
Outdated
deviceStatus: | ||
gpuStatuses: | ||
- deviceResourceClaimStatus: | ||
deviceAttributes: | ||
driverVersion: | ||
version: 1.0.0 | ||
index: | ||
int: 0 | ||
model: | ||
string: LATEST-GPU-MODEL | ||
uuid: | ||
string: gpu-8e942949-f10b-d871-09b0-ee0657e28f90 | ||
pciAddress: | ||
string: 0000:01:00.0 | ||
deviceName: gpu-0 | ||
resourceClaimName: virt-launcher-vmi-fedora-9bjwb-gpu-resource-claim-m4k28 | ||
name: pgpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, I am not convinced that storing this information in the VMI status is needed.
What other Kubevirt components require this information?
Why CDI cannot provide this info directly to the consumer container?
Signed-off-by: Alay Patel <[email protected]>
…lternatives Signed-off-by: Alay Patel <[email protected]>
@alicefr From our discussion in the meeting on 10/15 on lifecycle of resourceclaim and VM/VMI @varunrsekar and I looked at the nvme implementation and it seems that it was developed on an older version of the DRA apis (v1alpha2). With the newer version (k8s 1.31 and v1alpha3 of the ResourceClaim api), if the pod goes to failed state:
Here is the running demo from 1.31 cluster
This is further seen in the KEP: https://github.com/kubernetes/enhancements/pull/4709/files#diff-8fa237d276346416c2aafa209f721e43c9762f59cabec234eafafe01694de3eeR1221 I wonder if the use-cases that require the lifecycle of the ResourceClaim not attached to the lifecycle of the pod, be deferred to the future versions of DRA where it is possible for the allocation to persist after pod is deleted. |
/sig compute |
@alaypatel07 I think this can be taken out of draft/WIP |
@rthallisey done, I have couple of updates to be pushed from our last unconference sync up. Will plan on pushing those soon. |
@alaypatel07 sorry I missed the ping. I'm fine to tackle all the use case gradually. It needs just to be properly documented |
update apis based on unconference discussions with doc cleanups Signed-off-by: Varun Ramachandra Sekar <[email protected]> Co-authored-by: Varun Ramachandra Sekar <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- name: pci-nvme-request-name | ||
deviceClassName: pci-nvme.kubevirt.io | ||
allocationMode: ExactCount | ||
count: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alaypatel07 why do we want to have a count for the device? Doesn't a claim represent a single device? Or do we want to model a group of devices a well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alicefr with the alpha3 API the count field has to be set since the default allocation mode is ExactCount
, ref: https://github.com/kubernetes/api/blob/b7783abfc99c11f3b56fee4e5cf99023fcc8120a/resource/v1alpha3/types.go#L446
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it has carried over to the beta APIs too: https://github.com/kubernetes/api/blob/master/resource/v1beta1/types.go#L457 :)
kubevirt.io/vm: vm-cirros | ||
name: vm-cirros | ||
spec: | ||
running: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please use the new API: runStrategy: Halted
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note: