Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterAPI Machine stuck in "Pending" indefinitely #1222

Open
lieberlois opened this issue Oct 16, 2023 · 5 comments
Open

ClusterAPI Machine stuck in "Pending" indefinitely #1222

lieberlois opened this issue Oct 16, 2023 · 5 comments

Comments

@lieberlois
Copy link

lieberlois commented Oct 16, 2023

After a full reinstallation of Sidero Metal with CAPI etc. I have the problem that my machine doesnt boot into Talos.

The first PXE boot worked perfectly, discovery etc. worked, also the BMC entry is present (and works, tested with ipmitool) in the server. However I now applied the cluster manifests, my machine doesn't boot (not when I manually boot, also not via IPMI). It seems like Sidero doesnt use the (available) server (see output below) since its not allocated.

Any ideas?

$ kubectl get machine
NAME                 CLUSTER     NODENAME   PROVIDERID   PHASE     AGE     VERSION
cluster-0-cp-t6zhc   cluster-0                           Pending   6m38s   v1.28.1

$ kubectl get server
NAME                                   HOSTNAME   ACCEPTED   CORDONED   ALLOCATED   CLEAN   POWER   AGE
00000000-0000-0000-0000-000000000000   (none)     true                              true    off     21m

$ kubectl get serverclass
NAME   AVAILABLE                                  IN USE   AGE
any    ["00000000-0000-0000-0000-000000000000"]   []       30m

$ kubectl get serverbindings
No resources found

I also found this log line:

2023-10-16T09:41:41Z    INFO    controllers.MetalMachine.machine=cluster-0-cp-t6zhc.cluster=cluster-0   Bootstrap secret is not available yet   {"metalmachine": {"name":"cluster-0-cp-95bmt","namespace":"default"}}
@smira
Copy link
Member

smira commented Oct 16, 2023

There's no way we can guess this. As always in CAPI, it makes sense to inspect all states of all resources.

clusterctl status (or something like that) provides a nice overview

@lieberlois
Copy link
Author

cabpt-controller-manager-5687d76d6f-55xg6 manager 2023-10-16T10:36:15Z  INFO    Starting Controller     {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig"}
cabpt-controller-manager-5687d76d6f-55xg6 manager W1016 10:36:15.626507       1 reflector.go:533] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope

After looking through logs, this seems like it might be an RBAC issue maybe?

clusterctl describe cluster cluster-0 outputs the following:

NAME                                                           READY  SEVERITY  REASON                          SINCE  MESSAGE                                                                                                  
Cluster/cluster-0                                              False  Error     BootstrapTemplateCloningFailed  16m    Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...  
├─ClusterInfrastructure - MetalCluster/cluster-0                                                                                                                                                                                 
└─ControlPlane - TalosControlPlane/cluster-0-cp                False  Error     BootstrapTemplateCloningFailed  16m    Failed to create bootstrap configuration: Internal error occurred: failed calling webhook "vtaloscon ...  
  └─Machine/cluster-0-cp-szn6c                                 False  Info      WaitingForInfrastructure        15m    0 of 2 completed                                                                                          
    ├─BootstrapConfig - TalosConfig/cluster-0-cp-f5jj9                                                                                                                                                                           
    └─MachineInfrastructure - MetalMachine/cluster-0-cp-r8df7 

@smira
Copy link
Member

smira commented Oct 16, 2023

looks like it's the failure to call a webhook, probably MachinePools is a different issue

either way as it works in Sidero integration tests, something is up with your setup (?)

@lieberlois
Copy link
Author

Mhm it's a fresh setup, and the (what i think to be the) same setup worked with 0.5.8 and 0.6.0 (with firewall configured to block port 67 and 68) 🤔 I'm running Clusterctl Version 1.5.2 and K8s version 1.27.6

I was just looking at the cabpt-controller-manager because it appears to crash every few minutes

E1016 11:00:17.100532       1 reflector.go:148] /.cache/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1beta1.MachinePool: failed to list *v1beta1.MachinePool: machinepools.cluster.x-k8s.io is forbidden: User "system:serviceaccount:cabpt-system:default" cannot list resource "machinepools" in API group "cluster.x-k8s.io" at the cluster scope
2023-10-16T11:01:08Z    ERROR   Could not wait for Cache to sync        {"controller": "talosconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "TalosConfig", "error": "failed to wait for talosconfig caches to sync: timed out waiting for cache to be synced for Kind *v1alpha3.TalosConfig"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:233
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /.cache/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219

@lieberlois
Copy link
Author

lieberlois commented Oct 16, 2023

Info: i added this rule to the cabpt-manager-role ClusterRole, now it appears to work

rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - '*'

To me it seems like an RBAC issue, though its unclear yet, why that is. One possibilty might be that ClusterAPI changed their apiGroups for some resources, but i guess that would be noted as a breaking change.

Just from looking at kubectl api-resources though, the MachinePools is from cluster.x-k8s.io/v1beta1 but the ClusterRole says exp.cluster.x-k8s.io 🤔

Note: I also have the ClusterAPI Provider for Azure installed, maybe there is a conflict in apiGroups? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants