Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

xagent003 · 2020-10-20T22:46:17Z

The device plugin only detects device info as a one shot when it starts up. VFs need to be created and drivers bound (if using something besides default kernel driver) BEFORE we create the sriov device-plugin's ConfigMap and spin up the device-plugin daemonset. I accidentally did it other way around, and did not see the device plugin discovering the newly created resource. The capacity/allocatable was reported as zero.

Only way to fix this was to kill each pod (easier to just kubectl delete the daemonset, then re-apply)

IMO, it should periodically monitor devices under its ConfigMap.

There may be cases where the VFs created under a NIC change.

Also, it doesn't place a strict requirement on the workflow of node configuration. We can spin up the DevicePlugin and its configMap, and configure the host networking later, as in my case where I deployed multus/sriov/deviceplugin, then realized I forgot to actually write to the "sriov_numvfs" file.

adrianchiris · 2020-10-21T06:54:14Z

I would say there are cases that need to be handled:

node state has changed (driver bound, sriov created) that caused change to a resource
config map changed

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet.
im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

amorenoz · 2020-10-21T07:33:40Z

How would we handle already allocated resources?

adrianchiris · 2020-10-21T07:38:59Z

@amorenoz can you elaborate ?
you mean keep the IDs of existing devices?

amorenoz · 2020-10-21T07:57:45Z

@adrianchiris I mean the Device Plugin does not know which of its resources has already been allocated to pods. It is called on Allocate() but not on pod teardown. This may make handling of some changes challenging, e.g:
A configMap change removes an allocated resource
A configMap change modifies the way an allocated device is exposed (e.g: rdma, vdpa, vhost-net, etc), but the device (and it's information file) is already being used.
A configMap change moves a device to a different pool which the pod that is currently using the device is not entitled to use.

Maybe if a change in the node/configMap is detected, the DP could query kubelet and only proceed with the resource pools reconfiguration if none of it's resources are currently being used (also during the reconfiguration process, it must return an error on Allocate())

adrianchiris · 2020-10-21T08:47:01Z

I understand your point. But should it be the device plugin responsibility to handle that ?
You will essentially end up the same if you do the changes and restart device plugin without draining the node.

Now, assuming the Operator (human or otherwise) knows what he/it is doing, does it make sense that sriov device plugin reports and allocates resources based on the current configuration and node state instead of what it discovered on startup?
In case both configMap and node state remain un-changed it would be the same.

zshi-redhat · 2020-10-21T08:52:01Z

I would say there are cases that need to be handled:
1. node state has changed (driver bound, sriov created) that caused change to a resource

I think there are two cases wrt node state changes:

changes are applied to existing VFs
changes are not applied to existing VFs

For 1), it needs to first drain the node, then conduct the change, otherwise may impact the running workload. so dynamic monitoring won't be useful here as we would anyway need to restart device plugin pod or daemon.
For 2), dynamic monitoring in device plugin would allow discovery of a new resource pool.

We would want to see 2) be supported, but not break 1) scenario.

2. config map changed

Same for configmap change, we need to consider adding dynamic monitoring without breaking the use of existing resource.

Another aspect is if the change of configmap splits existing resource pools (whose resource has been allocated to pod) into different pools, we might need to consider adding device health check and reporting the allocated device as unhealthy in the new resource pool via DP API (this way, it wont break the running workload and allow the allocated resource be returned to the new pool once previous pod is deleted)

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet.
im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

If we can have device plugin keep tracking allocated devcies vs un-allocated devices, then DP can intelligently decide whether it needs to restart or dynamically expose newly created resources. Meanwhile, device plugin may provide a callback function for operator to get the decision on restarting because restarting is triggered by operator.

xagent003 · 2020-11-03T07:26:34Z

You will essentially end up the same if you do the changes and restart device plugin without draining the node.

I agree with that. We understand that it is disruptive to reconfigure VFs under a NIC. For example to change the # of VFs on a NIC, you have to clear and set it to 0 first. You can't just write a new non-zero number to sriov_numvfs file. At least for us, we make it clear that if you do this you better have already schedule Pods/VMs off this node or expect to lose connectivity. Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

zshi-redhat · 2021-01-29T01:57:10Z

Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

My understanding is no, @martinkennelly @adrianchiris may have more input.

martinkennelly · 2021-01-29T12:47:40Z

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

adrianchiris · 2021-01-31T08:03:39Z

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

Same for Nvidia (Mellanox) NICs

However it is possible to modify a VF's "capabilities" e.g when you load rdma drivers, a VF will now have an RDMA device associated with it.

There are plans in kernel to create "top level" devices for a PF/VF with devlink (e.g rdma, sf)

ipatrykx · 2021-07-06T15:45:01Z

As it is not really a solution for the original issue (which can be really annoying, can confirm that) I think that it maybe might be beneficial to create something like sriovdp-ctl tool that could initialize some kind of rescan command instead of deleting the pod and recreating it. Main benefits I can think of are:

Improved UX - instead of finding the SRIOV DP pod, deleting it and recreating (which additionally takes some time) user could just type something like kubectl sriovdp rescan <NODE_NAME> to refresh the resource list.
This would not break current approach as this would be still manually executed (that's why it's not really an answer to the original issue here).
Also, if DP will expose some kind of API for this then maybe operators (like sriov-network-operator) could use this approach instead of deleting the pods like @adrianchiris said (although not sure yet how such a communication would look like).
We could add additional commands if necessary. Maybe something like kubectl sriovdp get <NODE_NAME> to get a list of resources that are available on the node, or kubectl sriovdp logs <NODE_NAME> to get logs (getting logs from DP is also a bit annoying for me as you still have to find the name of the pod that runs on particular node, and if you delete the pod the name will change - that can be ofc overcome by defining some bash functions, but still it's inconvenient).
This will utilize the fact that, when in fact DP run's just once, a pod is still a long living process.

This would also be somewhat similar to the approach that might be soon implemented in multus according to this issue here: k8snetworkplumbingwg/multus-cni#488

And if so, maybe some of the codebase for this could be reused for similar tools of NPWG and maybe that could be a part of the common utils repository discussed here: k8snetworkplumbingwg/sriov-cni#187

But of course, this is just an idea here, so would love to hear you thoughts.

adrianchiris · 2021-07-06T16:57:49Z

adding a kubectl plugin would definetly be nice IMO !

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

ill bring this up in tue's meeting lets see if we can get additional inputs.

zshi-redhat · 2021-07-07T00:40:40Z

adding a kubectl plugin would definetly be nice IMO !

+1

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

One question is how can we get the state of VF which is attached to sriov pod.

ill bring this up in tue's meeting lets see if we can get additional inputs.

adrianchiris · 2021-07-13T15:37:29Z

This has been discussed in today's community meeting.

agreement that supporting system resource updates and sriov network device plugin config updates are two different topics
that should be addressed separately
kubelet monitoring endpoint can be used to calculate the full system state as there may be devices that are already assigned to
pods. see monitoring-device-plugin-resources
a periodic routine can be added to scan for devices and update on added/removed devices
a separate issue was created for device health checks as it is not directly related to this issue but its implementation may reuse code if this gets implemented.

Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.

We need to restart device plugin pod after node policy applied and all SR-IOV Network Config Daemon plugins finished. E.g.: Generic plugin applies SR-IOV Network Node Policy and creates VFs. That's mean we need to restart device plugin pod to found newly created devices. Related k8snetworkplumbingwg/sriov-network-device-plugin#276

Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.

Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed. (cherry picked from commit 4ebf517)

ipatrykx mentioned this issue Jul 21, 2021

Support device health-check #362

Open

e0ne mentioned this issue Aug 2, 2021

Restart device plugin pod after policy apply k8snetworkplumbingwg/sriov-network-operator#173

Merged

e0ne mentioned this issue Sep 7, 2021

Restart device plugin pod after node policy applied k8snetworkplumbingwg/sriov-network-operator#180

Closed

xagent003 mentioned this issue Feb 12, 2022

Streamline SRIOV workflow: Generate DevicePlugin ConfigMap and restart DP Pods platform9/luigi#24

Open

ian-howell mentioned this issue Mar 29, 2023

Add selectors nicNames #434

Closed

jslouisyou mentioned this issue Jun 5, 2024

Capacity and Allocatable number shows wrong if sriov-network-device-plugin restarts #565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

xagent003 commented Oct 20, 2020

adrianchiris commented Oct 21, 2020 •

edited

Loading

amorenoz commented Oct 21, 2020

adrianchiris commented Oct 21, 2020 •

edited

Loading

amorenoz commented Oct 21, 2020

adrianchiris commented Oct 21, 2020

zshi-redhat commented Oct 21, 2020

xagent003 commented Nov 3, 2020

zshi-redhat commented Jan 29, 2021

martinkennelly commented Jan 29, 2021

adrianchiris commented Jan 31, 2021

ipatrykx commented Jul 6, 2021

adrianchiris commented Jul 6, 2021

zshi-redhat commented Jul 7, 2021

adrianchiris commented Jul 13, 2021

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

Comments

xagent003 commented Oct 20, 2020

adrianchiris commented Oct 21, 2020 • edited Loading

amorenoz commented Oct 21, 2020

adrianchiris commented Oct 21, 2020 • edited Loading

amorenoz commented Oct 21, 2020

adrianchiris commented Oct 21, 2020

zshi-redhat commented Oct 21, 2020

xagent003 commented Nov 3, 2020

zshi-redhat commented Jan 29, 2021

martinkennelly commented Jan 29, 2021

adrianchiris commented Jan 31, 2021

ipatrykx commented Jul 6, 2021

adrianchiris commented Jul 6, 2021

zshi-redhat commented Jul 7, 2021

adrianchiris commented Jul 13, 2021

adrianchiris commented Oct 21, 2020 •

edited

Loading

adrianchiris commented Oct 21, 2020 •

edited

Loading