Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium Hubble Server not initialize after restore etcd snapshot #5122

Closed
lethehoa opened this issue Dec 9, 2023 · 7 comments
Closed

Cilium Hubble Server not initialize after restore etcd snapshot #5122

lethehoa opened this issue Dec 9, 2023 · 7 comments

Comments

@lethehoa
Copy link

lethehoa commented Dec 9, 2023

Environmental Info:
RKE2 Version:
rke2 version v1.26.9+rke2r1

Node(s) CPU architecture, OS, and Version:
5.4.0-167-generic #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 master - 8 workers

Describe the bug:
Cilium Hubble Server not initialize after I restore etcd snapshot
image

Steps To Reproduce:

  • I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
    systemctl stop rke2-server
    rke2 server --cluster-reset
    systemctl start rke2-server

  • After that, I check cilium status and encounter with those warning
    image

Expected behavior:
Everything work well especially CNI.

Actual behavior:
Cilium Hubble Server not initialize

Additional context / logs:

  • I was install cluster with CNI Cilium and then I overwrited some config using HelmChartConfig .

Check Hubble status in Cillium pod's log:

root@worker-101-2:/home/cilium# hubble status
failed to connect to 'unix:///var/run/cilium/hubble.sock': connection error: desc = "transport: error while dialing: dial unix /var/run/cilium/hubble.sock: connect: no such file or directory"

Cilium pod's log:

config Running
config level=info msg=Invoked duration=8.526806ms function="cmd.glob..func36 (build-config.go:32)" subsys=hive
config level=info msg=Starting subsys=hive
config level=info msg="Establishing connection to apiserver" host="https://10.171.0.1:443" subsys=k8s-client
apply-sysctl-overwrites sysctl config up-to-date, nothing to do
config level=info msg="Connected to apiserver" subsys=k8s-client
config level=info msg="Start hook executed" duration=21.159579ms function="client.(*compositeClientset).onStart" subsys=hive
config level=info msg="Reading configuration from config-map:kube-system/cilium-config" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
Stream closed EOF for kube-system/cilium-hx8mg (apply-sysctl-overwrites)
config level=info msg="Got 111 config pairs from source" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
config level=info msg="Reading configuration from cilium-node-config:kube-system/" configSource="cilium-node-config:kube-system/" subsys=option-resolver
config level=info msg="Got 0 config pairs from source" configSource="cilium-node-config:kube-system/" subsys=option-resolver
config level=info msg="Start hook executed" duration=55.605325ms function="cmd.(*buildConfig).onStart" subsys=hive
config level=info msg=Stopping subsys=hive
config level=info msg="Stop hook executed" duration="186.079┬╡s" function="client.(*compositeClientset).onStop" subsys=hive
Stream closed EOF for kube-system/cilium-hx8mg (config)
mount-cgroup level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups
Stream closed EOF for kube-system/cilium-hx8mg (mount-cgroup)
Stream closed EOF for kube-system/cilium-hx8mg (clean-cilium-state)
mount-bpf-fs none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Stream closed EOF for kube-system/cilium-hx8mg (mount-bpf-fs)
install-cni-binaries Installing cilium-cni to /host/opt/cni/bin/ ...
install-cni-binaries wrote /host/opt/cni/bin/cilium-cni
Stream closed EOF for kube-system/cilium-hx8mg (install-cni-binaries)
install-portmap-cni-plugin bandwidth is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin bridge is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin dhcp is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin copied /opt/cni/bin/dummy to /host/opt/cni/bin correctly
install-portmap-cni-plugin firewall is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin flannel is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin host-device is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin host-local is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin ipvlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin loopback is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin macvlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
install-portmap-cni-plugin ptp is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin sbr is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin static is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin tuning is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin vlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin vrf is in SKIP_CNI_BINARIES, skipping
Stream closed EOF for kube-system/cilium-hx8mg (install-portmap-cni-plugin)
cilium-agent level=info msg="Memory available for map entries (0.003% of 18861256704B): 47153141B" subsys=config
cilium-agent level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 82724" subsys=config
cilium-agent level=info msg="option bpf-nat-global-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-neigh-global-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-sock-rev-map-max set by dynamic sizing to 82724" subsys=config
cilium-agent level=info msg="  --agent-health-port='9879'" subsys=daemon
cilium-agent level=info msg="  --agent-labels=''" subsys=daemon
cilium-agent level=info msg="  --agent-liveness-update-interval='1s'" subsys=daemon
cilium-agent level=info msg="  --agent-not-ready-taint-key='node.cilium.io/agent-not-ready'" subsys=daemon
cilium-agent level=info msg="  --allocator-list-timeout='3m0s'" subsys=daemon
cilium-agent level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
cilium-agent level=info msg="  --allow-localhost='auto'" subsys=daemon
cilium-agent level=info msg="  --annotate-k8s-node='false'" subsys=daemon
cilium-agent level=info msg="  --api-rate-limit=''" subsys=daemon
cilium-agent level=info msg="  --arping-refresh-period='30s'" subsys=daemon
cilium-agent level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
cilium-agent level=info msg="  --auto-direct-node-routes='true'" subsys=daemon
cilium-agent level=info msg="  --bgp-announce-lb-ip='false'" subsys=daemon
cilium-agent level=info msg="  --bgp-announce-pod-cidr='false'" subsys=daemon
cilium-agent level=info msg="  --bgp-config-path='/var/lib/cilium/bgp/config.yaml'" subsys=daemon
cilium-agent level=info msg="  --bpf-auth-map-max='524288'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-tcp-grace='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-filter-priority='1'" subsys=daemon
cilium-agent level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-affinity-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dev-ip-addr-inherit=''" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dsr-dispatch='opt'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dsr-l4-xlate='frontend'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-external-clusterip='false'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-mode='snat'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-rev-nat-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-rss-ipv4-src-cidr=''" subsys=daemon
Stream closed EOF for kube-system/cilium-hx8mg (cilium-agent)
@brandond
Copy link
Member

brandond commented Dec 9, 2023

I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
systemctl stop rke2-server
rke2 server --cluster-reset
systemctl start rke2-server

That's not a restore from snapshot; all you did was reset the etcd cluster membership to a single node. Did you want to actually restore from a snapshot?

@lethehoa
Copy link
Author

I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
systemctl stop rke2-server
rke2 server --cluster-reset
systemctl start rke2-server

That's not a restore from snapshot; all you did was reset the etcd cluster membership to a single node. Did you want to actually restore from a snapshot?

rke2 server
--cluster-reset
--cluster-reset-restore-path=

I also run the command above, is it the right way to restore cluster from etcd snapshot?

@brandond
Copy link
Member

brandond commented Dec 11, 2023

Yes, restoring from a snapshot requires passing the path to the snapshot to restore, or the filename if using s3. Once it finishes, you should get additional instructions on what to do on the other servers to rejoin them.

@lethehoa
Copy link
Author

Yes, restoring from a snapshot requires passing the path to the snapshot to restore, or the filename if using s3. Once it finishes, you should get additional instructions on what to do on the other servers to rejoin them.

Thanks for your answer, I follow these step but still got the error related to cilium Hubble. The least option should be reinstall the whole cluster, right?

@brandond
Copy link
Member

That seems like overkill... have you looked at logs for all the containers in that pod? The error indicates that there is another prior failure that you need to resolve. Something else is failing to create that socket file.

@lethehoa
Copy link
Author

That seems like overkill... have you looked at logs for all the containers in that pod? The error indicates that there is another prior failure that you need to resolve. Something else is failing to create that socket file.

Thanks for your response. I reinstalled Cilium using Helm, and it worked.

@kode15333
Copy link

I encountered a similar issue after rejoining a worker node to the Kubernetes cluster.

error="failed to apply option: listen tcp :4244: bind: address already in use" subsys=hubble

sudo netstat -tulnp | grep :4244
- tcp6       0      0 :::4244                 :::*                    LISTEN      ****/cilium-agent 

sudo fuser -k 4244/tcp

kubectl rollout restart ds/cilium -n kube-system

after my cilium status is okay
Leave it for sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants