Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert K8s-prod nodes k8s-node-7 and k8s-node-8 from VMs to bare-metal #47

Open
nickatnceas opened this issue Aug 15, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@nickatnceas
Copy link
Contributor

nickatnceas commented Aug 15, 2024

K8s-prod nodes k8s-node-7 and k8s-node-8 are currently on physical hosts host-ucsb-24 and host-ucsb-25. Deleting the node VMs and redeploying the node directly on the host will allow us to use memory that was previously reserved for the host, and provide a small performance boost (~5% ?) due to it no longer being virtualized.

Since these nodes do not benefit from Live Migration, ie they can be drained at any time without major interruptions in services, and because the physical hosts will not be sharing resources with any other VMs, the there are no benefits of having VMs in this case.

Dev nodes will move from hosts 24 and 25 to hosts 9 and 10, and move from 16 to 32 vCPUs

Current:

  • host-ucsb-24
    • k8s-node-7 VM
      • 256 vCPUs
      • 352 GB memory
    • k8s-dev-node-4 VM
      • 16 vCPUs
      • 128 GB memory
  • host-ucsb-25
    • k8s-node-8 VM
      • 256 vCPUs
      • 352 GB memory
    • k8s-dev-node-5 VM
      • 16 vCPUs
      • 128 GB memory

Planned:

  • host-ucsb-24
    - 256 vCPUs
    - 512 GB memory
  • host-ucsb-25
    - 256 vCPUs
    - 512 GB memory
  • host-ucsb-9
    • k8s-dev-node-4 VM
      • 32 vCPUs
      • 128 GB memory
  • host-ucsb-10
    • k8s-dev-node-5 VM
      • 32 vCPUs
      • 128 GB memory
@nickatnceas nickatnceas added the enhancement New feature or request label Aug 15, 2024
@nickatnceas nickatnceas self-assigned this Aug 15, 2024
@nickatnceas
Copy link
Contributor Author

I attempted deploying the k8s software onto host-ucsb-24 to run a bare-metal node, but hit some issues:

  • K8s 1.23.4 packages are no longer being distributed by the K8s project
  • The oldest packages which are being distributed, 1.24.17, do not work with our cluster. The software successfully connects to the controller nodes, but never successfully starts up any containers

Instead of troubleshooting this old version, I'm going to move back to using the VMs for now. Once we have successfully upgraded K8s #35 we can try again.

@nickatnceas
Copy link
Contributor Author

Quick view of the issue:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep host-ucsb-24
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-jc89x         0/3     CrashLoopBackOff   18 (4m39s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-q9wn8           3/3     Running            18 (3m32s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       calico-node-hdpx4                              0/1     CrashLoopBackOff   6 (112s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       kube-proxy-mqdd8                               0/1     CrashLoopBackOff   5 (113s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
velero            node-agent-dwwp2                               0/1     CrashLoopBackOff   6 (2m26s ago)     16m      192.168.99.136    host-ucsb-24    <none>           <none>

Here is the k8s-node-7 VM after about the same amount of startup time:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep k8s-node-7
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-c78rc         3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-jr8c2           3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       calico-node-pbchl                              1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       kube-proxy-6kbc5                               1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
velero            node-agent-wqwn6                               1/1     Running      3 (11m ago)       16m      192.168.197.192   k8s-node-7      <none>           <none>

@mbjones
Copy link
Member

mbjones commented Oct 18, 2024

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

@nickatnceas
Copy link
Contributor Author

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

I don't feel like this is worth troubleshooting for a few reasons, but mainly because these two versions are so old (1.23 and 1.24), and the time would be better spent upgrading to the latest version and troubleshooting those issues (#35), then fixing any issue that arrive from this migration.

@mbjones
Copy link
Member

mbjones commented Oct 18, 2024

Yep, totally agree on the version/upgrade stuff. Sorry for the diversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants