Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProviderID tolerations to support Sidero, RKE, etc #140

Open
iAdanos opened this issue Sep 3, 2024 · 16 comments
Open

ProviderID tolerations to support Sidero, RKE, etc #140

iAdanos opened this issue Sep 3, 2024 · 16 comments

Comments

@iAdanos
Copy link

iAdanos commented Sep 3, 2024

Feature Request

Add an option to tolerate some ProviderIDs on nodes, to permit an app labeling such nodes.

Description

To resolve the same issue as #111 for RKE2 and other K8S engines, it would be great to enable some flag or option to permit Proxmox-CCM application labeling nodes having ProviderID in labels.

For example, RKE labels nodes on bootstrap, but after that it assumes that cluster administrator will labels nodes on it's own.

An option to tolerate some provider IDs to not to skip such nodes would be great for such cases.

@sergelogvinov
Copy link
Owner

Can you give me example please, I did not get your idea.

ProviderID magic string uses by many controllers to find the node in the cloud. Do you need to label the nodes by special logic/rules?

@fibbs
Copy link

fibbs commented Sep 8, 2024

Hi there,

as I am also affected by this (RKE2), let me try to explain what probably is @iAdanos problem, mine is it for sure:

proxmox-cloud-controller skips setting node labels when a "foreign providerID" is set. Unfortunately, RKE2 sets the ProviderID itself, and therefore when trying to use proxmox-cloud-controller, we don't get it set the labels we need on the nodes in order to use proxmox-csi-plugin.

I0908 21:32:40.171009       1 instances.go:111] instances.InstanceMetadata() called, node: m-b552ec39-b6a4-44a2-a172-6d19c1cc7471
I0908 21:32:40.171039       1 instances.go:129] instances.InstanceMetadata() node m-b552ec39-b6a4-44a2-a172-6d19c1cc7471 has foreign providerID: rke2://m-b552ec39-b6a4-44a2-a172-6d19c1cc7471, skipped
I0908 21:32:40.171074       1 instances.go:111] instances.InstanceMetadata() called, node: m-c1571c9c-e68c-4f4e-9922-f785eabdaf5a
I0908 21:32:40.171081       1 instances.go:129] instances.InstanceMetadata() node m-c1571c9c-e68c-4f4e-9922-f785eabdaf5a has foreign providerID: rke2://m-c1571c9c-e68c-4f4e-9922-f785eabdaf5a, skipped
I0908 21:32:40.171106       1 instances.go:111] instances.InstanceMetadata() called, node: m-cd52f75d-da48-4913-97bf-676bf61000c6
I0908 21:32:40.171116       1 instances.go:129] instances.InstanceMetadata() node m-cd52f75d-da48-4913-97bf-676bf61000c6 has foreign providerID: rke2://m-cd52f75d-da48-4913-97bf-676bf61000c6, skipped
I0908 21:32:40.171142       1 instances.go:111] instances.InstanceMetadata() called, node: m-f452e8bc-6bbe-4e7b-a518-f3853043c697
I0908 21:32:40.171155       1 instances.go:129] instances.InstanceMetadata() node m-f452e8bc-6bbe-4e7b-a518-f3853043c697 has foreign providerID: rke2://m-f452e8bc-6bbe-4e7b-a518-f3853043c697, skipped
I0908 21:32:40.171182       1 instances.go:111] instances.InstanceMetadata() called, node: m-f4f20ce1-f0ba-4768-a05e-e40bf92229e5
I0908 21:32:40.171189       1 instances.go:129] instances.InstanceMetadata() node m-f4f20ce1-f0ba-4768-a05e-e40bf92229e5 has foreign providerID: rke2://m-f4f20ce1-f0ba-4768-a05e-e40bf92229e5, skipped
I0908 21:32:40.171215       1 instances.go:111] instances.InstanceMetadata() called, node: m-2d191054-a22a-4fb9-b36e-878eeb0eed82
I0908 21:32:40.171223       1 instances.go:129] instances.InstanceMetadata() node m-2d191054-a22a-4fb9-b36e-878eeb0eed82 has foreign providerID: rke2://m-2d191054-a22a-4fb9-b36e-878eeb0eed82, skipped

Now that I have looked at it a bit more in detail, it seems that rke2 doesn't allow to change the providerID, so the approach of this project would not work at all. Just skipping setting the providerID is also not a solution as apparently proxmox-csi-plugin searches for the VM ID (Proxmox) in there to identify where in the cluster the VM is.

I would like to do some tests disabling the default RKE2 Cloud-Controller-Manager, but I am not sure whether this project will act as a full replacement and disable the taints a host has if "no cloud controller manager" is running. Will let you know what the result is.

Other than that, would it be viable to switch and to store the Proxmox node id into a label instead, and use this label also in proxmox-csi-plugin, maybe? This way, we could leave RKE2s own magic as it is and still maintain the important functionality.

Thanks a lot in advance

Christian

@fibbs
Copy link

fibbs commented Sep 8, 2024

I did some tests in my RKE2 cluster:

I have set in the cluster configuration (in Rancher):

    machineGlobalConfig:
      disable-cloud-controller: true

This causes the nodes that are already existing to just stop the RKE-own cloud-controller-managers, but keep their providerID as is. Cluster continues working as if nothing happened.

With this configuration, I have tried to add a new node to the cluster. Following happens:

No RKE2 cloud-controller-manager -> no providerID gets set by it. Proxmox-cloud-controller tries to do its thing, but:

I0908 22:17:40.181590       1 instances.go:111] instances.InstanceMetadata() called, node: m-fb9c5542-4f4f-4ffb-bc2d-2b04211926d5
I0908 22:17:40.181608       1 instances.go:163] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node m-fb9c5542-4f4f-4ffb-bc2d-2b04211926d5?

So, proxmox-cloud-controller-manager still refuses to set stuff for this node, now because "cloud-provider" name is (apparently automatically) set to "external". Therefore, no providerID gets set at all, and therefore this node never appears in Rancher (I had a very similar situation once with vsphere cloud provider, but completely different reason).

Then I tried the following:

    machineGlobalConfig:
      disable-cloud-controller: true
      cloud-provider-name: proxmox

This unfortunately led an existing node, the bootstrap node to get unavailable for Rancher, as apparently there are only a few of "cloud-provider-name" settings allowed by RKE2. So setting it like this is not an option.

I have found an example with the "out-of-tree AWS" Cloud-Controller-Manager here with a possibility to not set the rke2 option cloud-provider-name but add the kubelet arg --cloud-provider=<something> explicitely. I will probably play around with this tomorrow, but I don't have too much hope.

@sergelogvinov
Copy link
Owner

Hi, yep ProviderID is magic string, it can be set only once on join process. It is immutable value.
You can set the ProviderID as kubelet argument. But with CCM it is not necessary.

I've never used the RKE but base on documentation, you need:

https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/set-up-cloud-providers/amazon#using-the-out-of-tree-aws-cloud-provider

  1. set the RKE config.

for controlplane:

spec:
  rkeConfig:
    machineSelectorConfig:
      - config:
        disable-cloud-controller: true
        kube-apiserver-arg:
          - cloud-provider=external
        kube-controller-manager-arg:
          - cloud-provider=external
        kubelet-arg:
          - cloud-provider=external

workers:

spec:
  rkeConfig:
    machineSelectorConfig:
      - config:
        disable-cloud-controller: true
        kubelet-arg:
          - cloud-provider=external
  1. The name of VM have to be the same as node name. CCM will define the node ID by hostname.

@sergelogvinov
Copy link
Owner

I am curious, who sets the name as m-2d191054-a22a-4fb9-b36e-878eeb0eed82?
Do you use node-autoscaller or any solutions for proxmox?

@fibbs
Copy link

fibbs commented Sep 9, 2024

I am curious, who sets the name as m-2d191054-a22a-4fb9-b36e-878eeb0eed82? Do you use node-autoscaller or any solutions for proxmox?

Using elementalOS as the base OS for the K8S nodes. You create a registration endpoint in Rancher, put a Cloud-Config behind, let Rancher create an iso image for you and use this to install VMs (automated, manual, PXE, Pulumi, whatever). With this, you create an "inventory of machines" of which Rancher will then just "grab some" depending on the label settings you supply when spinning up a new cluster. For these "machines", rancher creates random names.

In my case, I have VMs called:

  • elemental-s-01
  • elemental-s-02
  • elemental-m-01
  • elemental-m-02
  • ....

where "s" stands for "small" and "m" for "medium". These in the "inventory of machines" appear as follows:

  • m-2d191054-a22a-4fb9-b36e-878eeb0eed82
  • m-b552ec39-b6a4-44a2-a172-6d19c1cc7471
  • ...

...and these are being used as node names for the RKE2 cluster as well.

So, the need of having "the VM name equal to the Kubernetes node name" may be another problem for using this CCM with Rancher, as this is definitely only the case if you build your cluster "manually" which (hopefully) nobody would do....

@fibbs
Copy link

fibbs commented Sep 9, 2024

Thanks for the super quick reply. I quite do not understand some things yet, though...

Hi, yep ProviderID is magic string, it can be set only once on join process. It is immutable value. You can set the ProviderID as kubelet argument. But with CCM it is not necessary.

What exactly does proxmox-csi-plugin expect to find in the providerID field? In the code I have found something about vmID, is this the vmid of Proxmox, like "100" for the first vm and continuing?

I've never used the RKE but base on documentation, you need:

https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/set-up-cloud-providers/amazon#using-the-out-of-tree-aws-cloud-provider

  1. set the RKE config.

for controlplane:

spec:
  rkeConfig:
    machineSelectorConfig:
      - config:
        disable-cloud-controller: true
        kube-apiserver-arg:
          - cloud-provider=external
        kube-controller-manager-arg:
          - cloud-provider=external
        kubelet-arg:
          - cloud-provider=external

workers:

spec:
  rkeConfig:
    machineSelectorConfig:
      - config:
        disable-cloud-controller: true
        kubelet-arg:
          - cloud-provider=external

Yeah, that's what I referred to in my last post yesterday. BUT, you are suggesting cloud-provider=external. Wouldn't that lead to the same error I already saw yesterday:

I0908 22:17:40.181608       1 instances.go:163] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node m-fb9c5542-4f4f-4ffb-bc2d-2b04211926d5?

did I misinterpret this message? I thought, the CCM would skip this node because having cloud-provider=external set, and it expected it to be cloud-provider=proxmox, but I may have interpreted something wrong here.

  1. The name of VM have to be the same as node name. CCM will define the node ID by hostname.

As mentioned in my other answer from a few minutes ago, this is a problem at least when using any of the Rancher "Cloud providers" which should be default nowadays when managing clusters dynamically. We cannot guarantee that the Kubernetes node name is equal to the VM name. Any chance to overcome this limitation? I mean, actually if we have the vmid, whouldn't this be enough to recognize the VM?

@sergelogvinov
Copy link
Owner

Thanks, now i see that happens.

The value cloud-provider=external is reserved for the kubelet daemon. It is used to indicate that an external cloud provider is responsible for node management.

The providerID is an immutable value that can only be set once, either by the CCM or the kubelet. After it is set, it cannot be modified.

CCM must define the virtual machine ID on the Proxmox side for proper operation. This ensures that the node can be correctly identified and managed within the CCM/CSI.

So, we need to somehow pass the vmID to the VM. Do you know how Rancher can help with this?
Like, run kubelet --node-labels=proxmox-vmid=${vmID} or kubelet --provider-id=proxmox://${cluster-region}/${vmID}

@iAdanos
Copy link
Author

iAdanos commented Sep 9, 2024

Labels. https://docs.rke2.io/advanced#node-labels-and-taints

As far as I understand RKE2 sets labels on node cluster join and after that supposes, that labels should be managed by cluster admin, so if Proxmox CCM/CSI will set any other label not managed by Rancher, it will just keep it as is.

@fibbs
Copy link

fibbs commented Sep 9, 2024

CCM must define the virtual machine ID on the Proxmox side for proper operation. This ensures that the node can be correctly identified and managed within the CCM/CSI.

That's understandable, as the CCM will have to attach/unattach and even move PVs from one VM to the other, it needs to know its VMID. Got it, thanks.

So, we need to somehow pass the vmID to the VM. Do you know how Rancher can help with this? Like, run kubelet --node-labels=proxmox-vmid=${vmID} or kubelet --provider-id=proxmox://${cluster-region}/${vmID}

So, do I understand you correctly that the default procedure of the CCM is as follows:

  1. check out the node name of the K8S node
  2. search for the node name in the configured Proxmox clusters
  3. set the providerID of the node with the vmid found in Proxmox cluster

...and as VMID does not change during VM lifetime, awesome!

If that's right, I kind of have found a solution for that part:

Elemental installer does set so called "MachineName" for each of the VMs that register themselves to be available to deploy a K8S on, it uses some UUID for the machine names, so that those machine names, from which the K8S node names will be derived (it's the same name in fact) have nothing to do with whatever you set as VM names in Proxmox. BUT: you can use SMBIOS data of the VM to set "Machine" labels (which will be K8S node labels later on, also) and even the "Machine Name". That's what I did: I have set machine name and some labels to reflect the SMBIOS "Product" field, and at VM creation I have added the VM name once in the VM name field and once in the "SMBIOS Options" -> "Product" field, in Proxmox. With this I end up with a Kubernetes cluster having nodes with equal names as the nodes' VM names in Proxmox, and also some good labels:

> k get nodes --show-labels
NAME                STATUS   ROLES                              AGE   VERSION          LABELS
elemental-test-01   Ready    control-plane,etcd,master,worker   13m   v1.30.4+rke2r1   CPUModel=QEMU-Virtual-CPU-version-2-5,CPUVendor=GenuineIntel,CPUVendorTotalCPUCores=4,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=elemental-test-01,kubernetes.io/os=linux,machineUUID=e39fcd22-be45-4476-bdf0-44be345ba3fc,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node-role.kubernetes.io/worker=true,node.kubernetes.io/instance-type=rke2,plan.upgrade.cattle.io/system-agent-upgrader=073235b34c60b404b6f56e3ca4b54ea713eddb6ec26584f2822cf0f4,proxmoxVMName=elemental-test-01,rke.cattle.io/machine=3f5910ac-881a-4a1a-8651-dfd46a4ce567
elemental-test-02   Ready    control-plane,etcd,master,worker   13m   v1.30.4+rke2r1   CPUModel=QEMU-Virtual-CPU-version-2-5,CPUVendor=GenuineIntel,CPUVendorTotalCPUCores=4,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=elemental-test-02,kubernetes.io/os=linux,machineUUID=b44d7195-e8e9-4c3e-ac58-87d652fe8625,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node-role.kubernetes.io/worker=true,node.kubernetes.io/instance-type=rke2,plan.upgrade.cattle.io/system-agent-upgrader=073235b34c60b404b6f56e3ca4b54ea713eddb6ec26584f2822cf0f4,proxmoxVMName=elemental-test-02,rke.cattle.io/machine=b8fe5cc1-76ef-4b5f-abff-09d0866ee078
elemental-test-03   Ready    control-plane,etcd,master,worker   17m   v1.30.4+rke2r1   CPUModel=QEMU-Virtual-CPU-version-2-5,CPUVendor=GenuineIntel,CPUVendorTotalCPUCores=4,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=elemental-test-03,kubernetes.io/os=linux,machineUUID=4505d678-d7c7-441c-aaf4-34e70086c2ef,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node-role.kubernetes.io/worker=true,node.kubernetes.io/instance-type=rke2,plan.upgrade.cattle.io/system-agent-upgrader=073235b34c60b404b6f56e3ca4b54ea713eddb6ec26584f2822cf0f4,proxmoxVMName=elemental-test-03,rke.cattle.io/machine=5b631e54-3edf-47d5-9639-8997f0ba3936

As the VMs will in future for sure not be added by hand, as I am doing right now but employing any kind of automation (Ansible, Terraform, Pulumi), depending on the automation stack and its capabilities, I believe it would still be VERY important to have an alternative way of getting to know which VMID the particular VM is inside Proxmox. I would strongly opt for CCMs options:

  • vm-name-from-label: if set, VM name to search for vmid in Proxmox is taken from a given label of the K8S node instead from the K8S node's name
  • vmid-from-label: if set, VM name would not be taken into account at all, but automation stack would have to put the VMID from Proxmox (Terraform, Pulumi have capabilities to get IDs from resources and use them in subsequent steps) and set them into the SMBIOS VM settings directly

The second, if applicable, seems to be the more reliable method, because as far as I have experienced by accident: it is possible to set several VMs with the same VM name in Proxmox, which for sure would cause a horrible behavior.

Regarding cloud-controller-name and the necessary settings to even get a RKE2 cluster up and running with this CCM, I still have to read the posts above some more times and do more tests. I will eventually post again on that topic.

@fibbs
Copy link

fibbs commented Sep 10, 2024

I have not yet been able to get it to work, even with the "new" cluster where Node names are equal to VM names in Proxmox.

Actually, I strongly believe the settings in machineSelectorConfig like shown here shouldn't be necessary: https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/set-up-cloud-providers/amazon#using-the-out-of-tree-aws-cloud-provider as the machineSelectorConfig are meant to be used when you need different settings on different nodes, based on their labels. Otherwise, there is machineGlobalConfig which sets for all machines.

The fact that my last test actually made the CCM try to do something let me to try it this way again. So, I have set the following:

    machineGlobalConfig:
      cloud-provider-name: external
      disable-cloud-controller: true

This will disable RKE2's own Cloud Controller Manager (the one that sets the providerID with rke://hostname). The documentation of RKE2 doesn't say much about the cloud-provider-name setting, but it seams to set the cloud-provider=external flags on all components.

With this settings, I have killed one of my three nodes and let Rancher deploy a new one on a machine from the machine inventory. This node gets integrated in the K8S cluster, doesn't have a providerID at the beginning:

kubectl get node elemental-test-04 -o yaml |grep providerID

But unfortunately the Proxmox CCM does not really want to set one either:

I0910 08:01:07.940126       1 instances.go:48] instances.InstanceExists() called node: elemental-test-04
I0910 08:01:07.940154       1 instances.go:51] instances.InstanceExists() node elemental-test-04 has providerID: , omitting unmanaged node
I0910 08:01:07.940164       1 instances.go:73] instances.InstanceShutdown() called, node: elemental-test-04
I0910 08:01:07.940174       1 instances.go:76] instances.InstanceShutdown() node elemental-test-04 has foreign providerID: , skipped
I0910 08:01:12.940347       1 instances.go:48] instances.InstanceExists() called node: elemental-test-04
I0910 08:01:12.940385       1 instances.go:51] instances.InstanceExists() node elemental-test-04 has providerID: , omitting unmanaged node
I0910 08:01:12.940394       1 instances.go:73] instances.InstanceShutdown() called, node: elemental-test-04
I0910 08:01:12.940403       1 instances.go:76] instances.InstanceShutdown() node elemental-test-04 has foreign providerID: , skipped

Looks like it erroneously interprets the non-existing providerID as foreign set one? Is that a bug in the CCM?

@sergelogvinov
Copy link
Owner

Thank you @fibbs !

...and as VMID does not change during VM lifetime, awesome!

Yep, vmID change is very hard. And it possible only manually in proxmox shell. So, it is not our case.

So, do I understand you correctly that the default procedure of the CCM is as follows:

Yep, you are right.

  1. The kubelet joins the cluster and creates a node resource with minimal information (if cloud-provider=external)
  2. Kubelet taint the node as node.cloudprovider.kubernetes.io/uninitialized (if cloud-provider=external)
  3. CCM has kubernetes values: nodeName and provided-node-ip. It is trying to fine the vmID by Name in the Proxmox cloud.
  4. If node is found - CCM labels the node, set ProviderID and untainted the node
  5. All daemon sets and deployments can now be scheduled on the node.

The kubelet can label the node and set the ProviderID during the join process by --node-labels --provider-id
A shell script (OS side) can add these arguments to the kubelet at startup.

So we can pass the vmID through:

In my case: I am using terraform and it creates/updates meta-data before start VM. The user-data one for all worker nodes.

But, after your last comment, you gave me an idea:
Probably we already have machineUUID in node resource. So we can find the VM by UUID.

I have a plan to make a release soon. It would be great to implement this feature before then.
I'll try to implement it.

@fibbs
Copy link

fibbs commented Sep 10, 2024

Your idea sounds great!

Have you seen, we have written a message almost at the same time, so I suppose you haven't seen my newer message, such as I did not see yours. Do you have any idea why in my setup the CCM, now that I have successfully started it with a node not having a providerID and having K8S node name the same as Proxmox VM name, still doesn't set anything?

I0910 08:01:07.940126       1 instances.go:48] instances.InstanceExists() called node: elemental-test-04
I0910 08:01:07.940154       1 instances.go:51] instances.InstanceExists() node elemental-test-04 has providerID: , omitting unmanaged node
I0910 08:01:07.940164       1 instances.go:73] instances.InstanceShutdown() called, node: elemental-test-04
I0910 08:01:07.940174       1 instances.go:76] instances.InstanceShutdown() node elemental-test-04 has foreign providerID: , skipped
I0910 08:01:12.940347       1 instances.go:48] instances.InstanceExists() called node: elemental-test-04
I0910 08:01:12.940385       1 instances.go:51] instances.InstanceExists() node elemental-test-04 has providerID: , omitting unmanaged node
I0910 08:01:12.940394       1 instances.go:73] instances.InstanceShutdown() called, node: elemental-test-04
I0910 08:01:12.940403       1 instances.go:76] instances.InstanceShutdown() node elemental-test-04 has foreign providerID: , skipped

really scratching my head here

@sergelogvinov
Copy link
Owner

instances.InstanceExists instances.InstanceShutdown methods call then node is unhealthy or shutdown, so as providerID is empty, the CCM cannot define the status of the VM on proxmox side (omitting unmanaged node).

Try to remove the node resource and reboot the worker node. It seems to me the node is not properly initialized.

PS. The error message not so obvious, i've fixed it already in my branch.

@fibbs
Copy link

fibbs commented Sep 15, 2024

I wanted to let you know, I finally got the CCM running in a Rancher managed RKE2 cluster.

The process was:

  1. Set up RKE2 with the following settings:
    machineGlobalConfig:
      cloud-provider-name: external
      disable-cloud-controller: true
  1. Wait until the "bootstrap node" is been initialized by the rancher-system-agent.

The node will be in "Ready" state but Rancher will not let other nodes join the cluster, because the node will have the node.cloudprovider.kubernetes.io/uninitialized taint. I now have installed the CCM via Helm, directly on the node. This process will have to be automated in the future, but I believe RKE2 has a section in its cluster.yaml for that.

Problem was that I had the fully qualified domain name of the proxmox cluster in the settings. I am using an existing secret for the CCM to access the proxmox cluster:

clusters:
  - url: https://pve1.int.inqbeo.org:8006/api2/json
    insecure: true
    token_id: "kubernetes@pve!ccm"
    token_secret: "<secret>"
    region: inqbeo-ca-home

The CCM was not able to access the Proxmox cluster and timed out after some 5 minutes or so. Reason: CoreDNS is not running yet inside the cluster while the node still has the node.cloudprovider.kubernetes.io/uninitialized taint.

I now have changed the URL to contain the IP address, and suddenly it works as you described above:

  • CCM starts, finds the K8S node name in the cluster
  • sets the providerID and the labels and removes the taint
  • Rancher notices the node is not have the taint anymore, and continues setting up the cluster
  • other nodes get joined, all nodes get the labels

Nevertheless, especially in combination with RKE2 etc, I would love to see a "non-CCM" mode of this software. Actually, the fact to set some labels (yes, providerID would have to stay untouched in such a mode, and another node label would have to be used for storing the VMID), the fact that this software is a CCM is actually a bit counter-productive: it needs network communication (when no DNS is working at the stage of no cloud-provider being active), it needs a secret (which should be deployed with a GitOps tool like Flux which is not yet working at this stage)...

Even if the "no DNS" issue is possible to workaround, I would have to set the password / token for a user that has quite high permissions inside the Proxmox API as clear text in my cluster automation stack, in the cluster definition yaml.

What is your opinion on that?

@sergelogvinov
Copy link
Owner

sergelogvinov commented Sep 16, 2024

Thank you for feedback.

DNS issue:

  1. You can set tolerations for CoreDNS, The node.cloudprovider.kubernetes.io/uninitialized annotation is relevant only if you have pod IPAM managed outside the cluster, either by CNI or a cloud provider. Since Proxmox doesn't have this, it’s not applicable.
  2. You can use .Values.useDaemonSet=true in Helm. This deploys the CCM as a DaemonSet with dnsPolicy: ClusterFirstWithHostNet, which attempts to resolve DNS using the host's resolv.conf.

providerID:

Am I understanding correctly that you want to automatically label the node and removed kubernetes node resource when the nodes are scaled down or deleted on the Proxmox side, while keeping the providerID unchanged (non-Proxmox style)?

If yes: CCM is designed to work with a single type of cluster and manage all the nodes within that cluster. If any node does not belong to the cloud, the CCM must also remove it from the Kubernetes cluster.

I believe in hybrid cluster setups, so I implemented a CCM that supports multiple CCMs within a single cluster. It handles only the nodes that belong to Proxmox VE and skips others by checking the providerID for a specific magic string.
So providerID is very impotent for the CCMs.

We have already similar ideas kubernetes/cloud-provider#63, kubernetes/kubernetes#124885 like this. And we haven't started implementing them yet..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants