Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico Networking Fails To Initialize on New or Upgraded AWS-based FedoraCoreOS Next/Testing Typhoon Clusters #1546

Open
2 tasks
fifofonix opened this issue Dec 9, 2024 · 1 comment

Comments

@fifofonix
Copy link

fifofonix commented Dec 9, 2024

Description

Calico networking pods fail to initialize on Typhoon clusters in AWS running FedoraCoreOS Next/Testing.

Steps to Reproduce

Provide clear steps to reproduce the bug.

Provisioning clusters via tutorial on website for AWS/FedoraCoreOS except:

  • Using a private hosted zone and networking = "calico"

Then:

  • Follow tutorial with `os_stream = "stable" (default) shows cluster members in a Ready status. Success.
  • Follow tutorial with `os_stream = "testing", cluster members do not get to a Ready status. Calico pods fail init.

Logs captured from the failing cluster:

kubectl logs pod/calico-node-xxxx -c install-cni -n kube-system

2024-12-09 23:02:50.331 [INFO][1] cni-installer/<nil> <nil>: Running as a Kubernetes pod
2024-12-09 23:02:50.346 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
2024-12-09 23:02:50.347 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
2024-12-09 23:02:50.481 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico"
2024-12-09 23:02:50.481 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
2024-12-09 23:02:50.624 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico-ipam"
2024-12-09 23:02:50.624 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
2024-12-09 23:02:50.629 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/flannel"
2024-12-09 23:02:50.630 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
2024-12-09 23:02:50.637 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-local"
2024-12-09 23:02:50.637 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
2024-12-09 23:02:50.648 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/loopback"
2024-12-09 23:02:50.648 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
2024-12-09 23:02:50.656 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/portmap"
2024-12-09 23:02:50.656 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
2024-12-09 23:02:50.664 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/tuning"
2024-12-09 23:02:50.664 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
2024-12-09 23:02:50.664 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-12-09 23:02:50.779 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.27.3
2024-12-09 23:02:50.779 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
2024-12-09 23:02:50.780 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-12-09 23:03:20.781 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.3.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token": dial tcp 10.3.0.1:443: i/o timeout
2024-12-09 23:03:20.781 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.3.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token": dial tcp 10.3.0.1:443: i/o timeout

Additional information:

Testing 41.20241122.2.0 (kernel 6.11.8-300.fc41.x86_64) is the first testing version to experience this issue. Current next 41.20241122.1.0 also experiences the issue, but I think the prior version does too but that would require further tests to confirm.

This issue first occurred in our much customized typhoon clusters in non-prod which are a fork of typhoon v1.30 so I don't think this issue is related to the most recent version of k8s and/or typhoon. We have recreated these clusters using "ciluim" networking instead for the time being and this has addressed issues.

Expected behavior

Success

Environment

  • Platform: aws
  • OS: fedora-coreos
  • Release: v1.31.2
  • Terraform: v1.5.7
  • Plugins:
- Installing hashicorp/local v2.5.2...
- Installed hashicorp/local v2.5.2 (signed by HashiCorp)
- Installing hashicorp/null v3.2.3...
- Installed hashicorp/null v3.2.3 (signed by HashiCorp)
- Installing hashicorp/random v3.6.3...
- Installed hashicorp/random v3.6.3 (signed by HashiCorp)
- Installing hashicorp/tls v4.0.6...
- Installed hashicorp/tls v4.0.6 (signed by HashiCorp)
- Installing poseidon/ct v0.13.0...
- Installed poseidon/ct v0.13.0 (self-signed, key ID 8F515AD1602065C8)
- Installing hashicorp/aws v4.61.0...
- Installed hashicorp/aws v4.61.0 (signed by HashiCorp)

Possible Solution

Link to a PR or description.

@fifofonix
Copy link
Author

This sounds similar: projectcalico/calico#8368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant