Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276

Open
xagent003 opened this issue Oct 20, 2020 · 14 comments

Comments

@xagent003
Copy link

The device plugin only detects device info as a one shot when it starts up. VFs need to be created and drivers bound (if using something besides default kernel driver) BEFORE we create the sriov device-plugin's ConfigMap and spin up the device-plugin daemonset. I accidentally did it other way around, and did not see the device plugin discovering the newly created resource. The capacity/allocatable was reported as zero.

Only way to fix this was to kill each pod (easier to just kubectl delete the daemonset, then re-apply)

IMO, it should periodically monitor devices under its ConfigMap.

There may be cases where the VFs created under a NIC change.

Also, it doesn't place a strict requirement on the workflow of node configuration. We can spin up the DevicePlugin and its configMap, and configure the host networking later, as in my case where I deployed multus/sriov/deviceplugin, then realized I forgot to actually write to the "sriov_numvfs" file.

@adrianchiris
Copy link
Contributor

adrianchiris commented Oct 21, 2020

I would say there are cases that need to be handled:

  1. node state has changed (driver bound, sriov created) that caused change to a resource
  2. config map changed

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet.
im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

@amorenoz
Copy link
Contributor

How would we handle already allocated resources?

@adrianchiris
Copy link
Contributor

adrianchiris commented Oct 21, 2020

@amorenoz can you elaborate ?
you mean keep the IDs of existing devices?

@amorenoz
Copy link
Contributor

@adrianchiris I mean the Device Plugin does not know which of its resources has already been allocated to pods. It is called on Allocate() but not on pod teardown. This may make handling of some changes challenging, e.g:
A configMap change removes an allocated resource
A configMap change modifies the way an allocated device is exposed (e.g: rdma, vdpa, vhost-net, etc), but the device (and it's information file) is already being used.
A configMap change moves a device to a different pool which the pod that is currently using the device is not entitled to use.

Maybe if a change in the node/configMap is detected, the DP could query kubelet and only proceed with the resource pools reconfiguration if none of it's resources are currently being used (also during the reconfiguration process, it must return an error on Allocate())

@adrianchiris
Copy link
Contributor

I understand your point. But should it be the device plugin responsibility to handle that ?
You will essentially end up the same if you do the changes and restart device plugin without draining the node.

Now, assuming the Operator (human or otherwise) knows what he/it is doing, does it make sense that sriov device plugin reports and allocates resources based on the current configuration and node state instead of what it discovered on startup?
In case both configMap and node state remain un-changed it would be the same.

@zshi-redhat
Copy link
Collaborator

I would say there are cases that need to be handled:

1. node state has changed (driver bound, sriov created) that caused change to a resource

I think there are two cases wrt node state changes:

  1. changes are applied to existing VFs
  2. changes are not applied to existing VFs

For 1), it needs to first drain the node, then conduct the change, otherwise may impact the running workload. so dynamic monitoring won't be useful here as we would anyway need to restart device plugin pod or daemon.
For 2), dynamic monitoring in device plugin would allow discovery of a new resource pool.

We would want to see 2) be supported, but not break 1) scenario.

2. config map changed

Same for configmap change, we need to consider adding dynamic monitoring without breaking the use of existing resource.

Another aspect is if the change of configmap splits existing resource pools (whose resource has been allocated to pod) into different pools, we might need to consider adding device health check and reporting the allocated device as unhealthy in the new resource pool via DP API (this way, it wont break the running workload and allow the allocated resource be returned to the new pool once previous pod is deleted)

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet.
im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

If we can have device plugin keep tracking allocated devcies vs un-allocated devices, then DP can intelligently decide whether it needs to restart or dynamically expose newly created resources. Meanwhile, device plugin may provide a callback function for operator to get the decision on restarting because restarting is triggered by operator.

@xagent003
Copy link
Author

You will essentially end up the same if you do the changes and restart device plugin without draining the node.

I agree with that. We understand that it is disruptive to reconfigure VFs under a NIC. For example to change the # of VFs on a NIC, you have to clear and set it to 0 first. You can't just write a new non-zero number to sriov_numvfs file. At least for us, we make it clear that if you do this you better have already schedule Pods/VMs off this node or expect to lose connectivity. Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

@zshi-redhat
Copy link
Collaborator

Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

My understanding is no, @martinkennelly @adrianchiris may have more input.

@martinkennelly
Copy link
Member

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

@adrianchiris
Copy link
Contributor

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

Same for Nvidia (Mellanox) NICs

However it is possible to modify a VF's "capabilities" e.g when you load rdma drivers, a VF will now have an RDMA device associated with it.

There are plans in kernel to create "top level" devices for a PF/VF with devlink (e.g rdma, sf)

@ipatrykx
Copy link
Contributor

ipatrykx commented Jul 6, 2021

As it is not really a solution for the original issue (which can be really annoying, can confirm that) I think that it maybe might be beneficial to create something like sriovdp-ctl tool that could initialize some kind of rescan command instead of deleting the pod and recreating it. Main benefits I can think of are:

  1. Improved UX - instead of finding the SRIOV DP pod, deleting it and recreating (which additionally takes some time) user could just type something like kubectl sriovdp rescan <NODE_NAME> to refresh the resource list.
  2. This would not break current approach as this would be still manually executed (that's why it's not really an answer to the original issue here).
  3. Also, if DP will expose some kind of API for this then maybe operators (like sriov-network-operator) could use this approach instead of deleting the pods like @adrianchiris said (although not sure yet how such a communication would look like).
  4. We could add additional commands if necessary. Maybe something like kubectl sriovdp get <NODE_NAME> to get a list of resources that are available on the node, or kubectl sriovdp logs <NODE_NAME> to get logs (getting logs from DP is also a bit annoying for me as you still have to find the name of the pod that runs on particular node, and if you delete the pod the name will change - that can be ofc overcome by defining some bash functions, but still it's inconvenient).
  5. This will utilize the fact that, when in fact DP run's just once, a pod is still a long living process.

This would also be somewhat similar to the approach that might be soon implemented in multus according to this issue here: k8snetworkplumbingwg/multus-cni#488

And if so, maybe some of the codebase for this could be reused for similar tools of NPWG and maybe that could be a part of the common utils repository discussed here: k8snetworkplumbingwg/sriov-cni#187

But of course, this is just an idea here, so would love to hear you thoughts.

@adrianchiris
Copy link
Contributor

adding a kubectl plugin would definetly be nice IMO !

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

ill bring this up in tue's meeting lets see if we can get additional inputs.

@zshi-redhat
Copy link
Collaborator

adding a kubectl plugin would definetly be nice IMO !

+1

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

One question is how can we get the state of VF which is attached to sriov pod.

ill bring this up in tue's meeting lets see if we can get additional inputs.

@adrianchiris
Copy link
Contributor

This has been discussed in today's community meeting.

  • agreement that supporting system resource updates and sriov network device plugin config updates are two different topics
    that should be addressed separately
  • kubelet monitoring endpoint can be used to calculate the full system state as there may be devices that are already assigned to
    pods. see monitoring-device-plugin-resources
  • a periodic routine can be added to scan for devices and update on added/removed devices
  • a separate issue was created for device health checks as it is not directly related to this issue but its implementation may reuse code if this gets implemented.

e0ne added a commit to e0ne/sriov-network-operator that referenced this issue Aug 2, 2021
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
e0ne added a commit to e0ne/sriov-network-operator that referenced this issue Aug 2, 2021
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
e0ne added a commit to e0ne/sriov-network-operator that referenced this issue Sep 7, 2021
We need to restart device plugin pod after node policy applied
and all SR-IOV Network Config Daemon plugins finished. E.g.:

Generic plugin applies SR-IOV Network Node Policy and creates
VFs. That's mean we need to restart device plugin pod to found
newly created devices.

Related k8snetworkplumbingwg/sriov-network-device-plugin#276
pliurh pushed a commit to pliurh/sriov-network-operator-1 that referenced this issue Oct 20, 2021
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/sriov-network-operator that referenced this issue Dec 1, 2021
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/sriov-network-operator that referenced this issue Dec 1, 2021
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.
wizhaoredhat pushed a commit to wizhaoredhat/sriov-network-operator that referenced this issue Jun 5, 2023
Until k8snetworkplumbingwg/sriov-network-device-plugin#276
wiill be fixed we need to restart device plugin pod each time after
SR-IOV Network Operator plugin applied. It's needed because plugin
could change a number of VF resources even if config is not changed.

(cherry picked from commit 4ebf517)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants