-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276
Comments
I would say there are cases that need to be handled:
Today in both cases a restart to device plugin is required to properly report updated resources to kubelet. @zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ? |
How would we handle already allocated resources? |
@amorenoz can you elaborate ? |
@adrianchiris I mean the Device Plugin does not know which of its resources has already been allocated to pods. It is called on Allocate() but not on pod teardown. This may make handling of some changes challenging, e.g: Maybe if a change in the node/configMap is detected, the DP could query kubelet and only proceed with the resource pools reconfiguration if none of it's resources are currently being used (also during the reconfiguration process, it must return an error on Allocate()) |
I understand your point. But should it be the device plugin responsibility to handle that ? Now, assuming the Operator (human or otherwise) knows what he/it is doing, does it make sense that sriov device plugin reports and allocates resources based on the current configuration and node state instead of what it discovered on startup? |
I think there are two cases wrt node state changes:
For 1), it needs to first drain the node, then conduct the change, otherwise may impact the running workload. so dynamic monitoring won't be useful here as we would anyway need to restart device plugin pod or daemon. We would want to see 2) be supported, but not break 1) scenario.
Same for configmap change, we need to consider adding dynamic monitoring without breaking the use of existing resource. Another aspect is if the change of configmap splits existing resource pools (whose resource has been allocated to pod) into different pools, we might need to consider adding device health check and reporting the allocated device as unhealthy in the new resource pool via DP API (this way, it wont break the running workload and allow the allocated resource be returned to the new pool once previous pod is deleted)
If we can have device plugin keep tracking allocated devcies vs un-allocated devices, then DP can intelligently decide whether it needs to restart or dynamically expose newly created resources. Meanwhile, device plugin may provide a callback function for operator to get the decision on restarting because restarting is triggered by operator. |
I agree with that. We understand that it is disruptive to reconfigure VFs under a NIC. For example to change the # of VFs on a NIC, you have to clear and set it to 0 first. You can't just write a new non-zero number to sriov_numvfs file. At least for us, we make it clear that if you do this you better have already schedule Pods/VMs off this node or expect to lose connectivity. Is there even any other way to reconfigure VFs without it already disrupting allocated resources? |
My understanding is no, @martinkennelly @adrianchiris may have more input. |
For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources. |
Same for Nvidia (Mellanox) NICs However it is possible to modify a VF's "capabilities" e.g when you load rdma drivers, a VF will now have an RDMA device associated with it. There are plans in kernel to create "top level" devices for a PF/VF with devlink (e.g rdma, sf) |
As it is not really a solution for the original issue (which can be really annoying, can confirm that) I think that it maybe might be beneficial to create something like
This would also be somewhat similar to the approach that might be soon implemented in multus according to this issue here: k8snetworkplumbingwg/multus-cni#488 And if so, maybe some of the codebase for this could be reused for similar tools of NPWG and maybe that could be a part of the common But of course, this is just an idea here, so would love to hear you thoughts. |
adding a kubectl plugin would definetly be nice IMO ! regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state. ill bring this up in tue's meeting lets see if we can get additional inputs. |
+1
One question is how can we get the state of VF which is attached to sriov pod.
|
This has been discussed in today's community meeting.
|
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.
We need to restart device plugin pod after node policy applied and all SR-IOV Network Config Daemon plugins finished. E.g.: Generic plugin applies SR-IOV Network Node Policy and creates VFs. That's mean we need to restart device plugin pod to found newly created devices. Related k8snetworkplumbingwg/sriov-network-device-plugin#276
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed.
Until k8snetworkplumbingwg/sriov-network-device-plugin#276 wiill be fixed we need to restart device plugin pod each time after SR-IOV Network Operator plugin applied. It's needed because plugin could change a number of VF resources even if config is not changed. (cherry picked from commit 4ebf517)
The device plugin only detects device info as a one shot when it starts up. VFs need to be created and drivers bound (if using something besides default kernel driver) BEFORE we create the sriov device-plugin's ConfigMap and spin up the device-plugin daemonset. I accidentally did it other way around, and did not see the device plugin discovering the newly created resource. The capacity/allocatable was reported as zero.
Only way to fix this was to kill each pod (easier to just kubectl delete the daemonset, then re-apply)
IMO, it should periodically monitor devices under its ConfigMap.
There may be cases where the VFs created under a NIC change.
Also, it doesn't place a strict requirement on the workflow of node configuration. We can spin up the DevicePlugin and its configMap, and configure the host networking later, as in my case where I deployed multus/sriov/deviceplugin, then realized I forgot to actually write to the "sriov_numvfs" file.
The text was updated successfully, but these errors were encountered: