-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
still need help install sriov-network-operator #672
Comments
Hello, |
Hey, seems that you don't have the required It can be created with helm using the following parameters: |
@rollandf Thank you. set sriovOperatorConfig.deploy to true in default values.yaml, ran helm upgrade, the config daemon is up. Compared to the example in quick-start, we're still missing the service obj, is that expected? shall we create the svc manually? `$ kubectl --context dell4 get all -n sriov-network-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE NAME READY UP-TO-DATE AVAILABLE AGE NAME DESIRED CURRENT READY AGE |
I don't think that the service is needed. Seems an issue in doc actually. |
@rollandf Thank you. Next, with initial sriovnetworknodestates.sriovnetwork.openshift.io as:
I created a SriovNetworkNodePolicy, apiVersion: sriovnetwork.openshift.io/v1 It triggered creation of sriov-device-plugin, but the operator pod went into CrashLoopBackOff state, logs reported "panic: runtime error: invalid memory address or nil pointer dereference" and "[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a7004d]" How to fix this? [mtx@mtx-dell4-bld08 sriov-network-operator]$ kubectl get all -n sriov-network-operator NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE NAME READY UP-TO-DATE AVAILABLE AGE NAME DESIRED CURRENT READY AGE New state of sriovnetworknodestates.sriovnetwork.openshift.io: spec:
sriov-network-operator-845dc5dffc-4hvsb.log Thanks. -Jessica |
@SchSeba The SriovNetworkNodePolicy spec was pasted in my last comment. Thanks. -Jessica |
The yaml you shared is from a local file I want you to show me the one that is in the k8s api server. please run |
sriovnetworknodepolicy.yaml.txt nodeSelector is empty in attached output |
yep that was my expectation I think the label you wanted is something like: nodeSelector:
node-role.kubernetes.io/worker: "" |
@SchSeba Thank you. corrected nodeSelector in sriovnetworknodepolicy, the operator pod is back in Running state, but the sriovnetworknodestates still "cannot configure sriov interfaces", no VFs. Anything to check on hardware side? `apiVersion: sriovnetwork.openshift.io/v1
One of the sriov-device-plugin pod log (the other 2 are similar): I0411 19:59:15.007473 1 manager.go:57] Using Kubelet Plugin Registry Mode |
sriov-network-operator-845dc5dffc-4hvsb (2).log operator pod log. |
Is there a way to debug this issue? failed to configure sriov on interface. Worker nodes are running k8s 1.26, RH8.6 |
Hi as you can see it's an intel nic and in the status of the sriovNetworkNodeState there is no maxVf that points me out that you didn't enable sriov in the bios of the machine. |
@SchSeba Thanks! checking with lab on this. |
Lab team enabled sriov on the NICs. SriovNetworkNodeState now reports totalvfs: 64, but still "cannot configure sriov interfaces", tried delete and apply the same SriovNetworkNodePolicy, didn't help. `$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator mtx-dell4-bld01.dc1.matrixxsw.com -o yaml
|
Anything else we should check? |
Can you provide new logs from config daemon? |
Config daemon says cannot allocate memory.
|
Seems that you need to add the following kernel arg: |
@hymgg can you please check I think there is on the bios something called 4M memory or something like that |
@rollandf @SchSeba The VFs showed up after adding pci=realloc to kernel. Thanks! `$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n sriov-network-operator mtx-dell4-bld01.dc1.matrixxsw.com -o yaml
$ kubectl get no -o json | jq -r '[.items[] | {name:.metadata.name, allocable:.status.allocatable}]' $ kubectl get all -n sriov-network-operator $ kubectl logs sriov-device-plugin-4mdbw -n sriov-network-operator |
Forgot to mention, while the device plugin pods terminate/create, the nodes take turn to go into SchedulingDisabled state too. $ kubectl get node |
Hello, any other ideas to investigate? |
still looking for remedy to the situation... |
@hymgg I ran into this almost 8 months ago. Almost everything in your post. On a single node, clean test cluster this thing works. But our nodes have hundreds labels. Are your clusters rke? Generally this project did not seem great. Just even that the labels are hard coded is absolutely terrible. I'm already not looking forward to trudging down this path again. |
@ns-rlewkowicz Thanks for sharing your experience. Is there an alternative that works better? This is a vanilla k8s on bare metal rh8 nodes, installed with kubeadm. Just the essentials, nothing fancy. |
are the created SR-IOV virtual functions bound to intel driver ? from the logs it doesnt seem so. |
@hymgg can you please run
and also please check again in the bios configuration about
|
Hi @ns-rlewkowicz any specific issue that the community can help with? |
@SchSeba One of my biggest issue with that labels were hardcoded in the code. It didn't have a configurable option. I can hack around, I'm a good dev when I need to be. It was just a pain. Then I hit the segfaults same as this post and I called it. I just took a look at issues before I started again. For our current configurations we laid down the plugin manually. We have new node configurations coming as well as some other requirements so it was just more manual config or trying this. I should add too, same node, minikube, no segfaults. Production cluster config, seg faults. I'll track it down, im just not excited about it. I'm sorry I've not been good with responses. I'm a little selfish in this. I at least do give some break fix edit: Idk why the quick start doesn't lead with this, and instead uses this custom make system. Seems like it's just a docs gap. |
@hymgg You can lay the plugin down manually if you have limited node configurations. |
@adrianchiris autoprobe is enabled, is it possible something with this (version) of driver? is there alternative? [root@mtx-dell4-bld01 ~]# cat /sys/bus/pci/devices/0000:3b:00.0/sriov_drivers_autoprobe |
@SchSeba this is set to enable in the bios: lspci -v -nn -mm -k -s 0000:3b:00.0Slot: 3b:00.0 lspci -vvv -s 0000:3b:00.03b:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE backplane (rev 02) |
@ns-rlewkowicz Thank you, how to lay the plugin down manually? |
Any other ideas? |
Hi @hymgg, Sorry for the late response.
if that returns and error then the problem is something in the operating system/drivers and the operator will also not be able to configure the virtual functions. you can try to check with dmesg after the echo command to see if the drivers print any error. |
Anything else to check / try? |
Hi @hymgg, Looking at the dmesg.log you shared, it looks like the driver correctly creates the VFs, but network devices do not spawn.
I would double check if following the Intel guide produce a correct system configuration, as the problem seems to be out of the operator's scope. Can you also share the a |
@zeeke after manually echo 5 > .../sriov_numvfs, the operator pod has been crashing,
Attached current ns dump W/O any SriovNetworkNodePolicy |
The operator's pod is crashing due to a missing CRD in the cluster:
It sounds like the installation is not 100% clean. Can you try uninstall the operator completely and deploying it again? |
@zeeke sure. Tried to create the policy again, the device plugin pod soon went to Terminating state, and the node to SchedulingDisabled.
latest cluster-info dump, |
Hi @hymgg, looking at the archive you shared, I see the config daemon is reporting the error:
And I think the device-plugin failure depends on the Virtual Functions not having the driver installed. so, the problem here is that on the VF driver. Can you check that when you manually create VF ( If they don't spawn, there is a problem with the drivers and we have to check the kernel journal. |
@zeeke thanks for the followup. Removed SriovNetworkNodePolicy to restore node/pod to healthy state. 5 VF spawned OK.
|
your command Please, try running
If you can't configure SR-IOV devices manually, I'm afraid the problem is out of the operator's scope |
@zeeke Does modinfo output look OK? what else do we need to do to enable sriov?
|
Is it because we didn't add this to kernel? we don't have VM in this lab. |
BTW, as the intel guide [1] suggests to add them, please give it a try. |
intel_iommu=on iommu=pt didn't help, added to kernel and rebooted node
cluster-info-dump uploaded. |
Hi @hymgg can you please run |
@SchSeba thanks for the followup, will reinstall the operator and check with lspci. |
great I will wait for an update :) |
@SchSeba Found iavf in a blacklist.conf, talking to lab team about this. ` grep iavf /etc/modprobe.d/*/etc/modprobe.d/anaconda-blacklist.conf:blacklist iavf lspci|grep "Virtual Function"3b:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02) lspci -vv -nn -mm -k -s 3b:0a.0Slot: 3b:0a.0 lspci -vv -nn -mm -k -s 3b:0a.1Slot: 3b:0a.1 |
Removed iavf from blacklist. After reapply the SriovNetworkNodePolicy, pods/node stay alive, node allocatable resource list has "openshift.io/ens1f1": "8", so it's good.
Created a SriovNetwork sriovnetwork-ens1f1 using host-local ipam,
|
Next 2 questions, 1.) do we support whereabouts ipam? or what ipam should we use so pods on the same sriov network can talk to each other? After above success, I deleted test pod, and the SriovNetwork, changed its ipam from host-local to whereabouts, and recreated it. but the pod failed to create, error from describe pod:
2.) how do I create a SriovNetwork in a difference namespace? I tried modify namespace in above SriovNetwork yaml and apply, found nothing in new ns. Thanks. -Jessica |
@SchSeba Could you guide us on the 2 questions above? |
Hi @hymgg, sorry for the late redeploy to complete miss your message. the error you see is because you probably didn't configure whereabouts right you will need to check the documentation. |
Continuing from issue #584,
@adrianchiris Sorry for the late followup.
Install using helm was much easier than following the quick start steps. However, it only brought up the sriov-network-operator pod, according to quick start guide, there should be a sriov-network-config-daemon too?
`$ ls
Chart.yaml crds README.md templates values.yaml
$ helm3 install -n sriov-network-operator --create-namespace --wait sriov-network-operator ./
$ kubectl get all -n sriov-network-operator
NAME READY STATUS RESTARTS AGE
pod/sriov-network-operator-845dc5dffc-4hvsb 1/1 Running 0 20m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/sriov-network-operator 1/1 1 1 20m
NAME DESIRED CURRENT READY AGE
replicaset.apps/sriov-network-operator-845dc5dffc 1 1 1 20m
$ kubectl logs deployment.apps/sriov-network-operator -n sriov-network-operator|tail -5
2024-03-29T05:02:53.668128868Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "ed902977-3a07-4cea-bb20-0cefbff5ea9e"}
2024-03-29T05:02:58.668612364Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"}
2024-03-29T05:02:58.668676704Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "98591413-4718-4d3c-abaf-14d3dcf1c43c"}
2024-03-29T05:03:03.669236989Z INFO controller/controller.go:119 Reconciling {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"}
2024-03-29T05:03:03.669309844Z INFO controller/controller.go:119 default SriovOperatorConfig object not found, cannot reconcile SriovNetworkNodePolicies. Requeue. {"controller": "sriovnetworknodepolicy", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovNetworkNodePolicy", "SriovNetworkNodePolicy": {"name":"node-policy-sync-event"}, "namespace": "", "name": "node-policy-sync-event", "reconcileID": "2a0835ad-a117-4caa-8ace-9afc525b6d70"}
Additional info, may not be relevant.
$ kubectl label ns sriov-network-operator pod-security.kubernetes.io/enforce=privileged
$ kubectl get node -l node-role.kubernetes.io/worker
NAME STATUS ROLES AGE VERSION
mtx-dell4-bld01.dc1.matrixxsw.com Ready worker 264d v1.26.6
mtx-dell4-bld02.dc1.matrixxsw.com Ready worker 264d v1.26.6
mtx-dell4-bld03.dc1.matrixxsw.com Ready worker 264d v1.26.6
`
Shall we / how do we get sriov-network-config-daemon installed?
Thanks. -Jessica
Originally posted by @hymgg in #584 (comment)
The text was updated successfully, but these errors were encountered: