Cilium Hubble Server not initialize after restore etcd snapshot #5122

lethehoa · 2023-12-09T05:54:26Z

Environmental Info:
RKE2 Version:
rke2 version v1.26.9+rke2r1

Node(s) CPU architecture, OS, and Version:
5.4.0-167-generic #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 master - 8 workers

Describe the bug:
Cilium Hubble Server not initialize after I restore etcd snapshot

Steps To Reproduce:

I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
systemctl stop rke2-server
rke2 server --cluster-reset
systemctl start rke2-server
After that, I check cilium status and encounter with those warning

Expected behavior:
Everything work well especially CNI.

Actual behavior:
Cilium Hubble Server not initialize

Additional context / logs:

I was install cluster with CNI Cilium and then I overwrited some config using HelmChartConfig .

Check Hubble status in Cillium pod's log:

root@worker-101-2:/home/cilium# hubble status
failed to connect to 'unix:///var/run/cilium/hubble.sock': connection error: desc = "transport: error while dialing: dial unix /var/run/cilium/hubble.sock: connect: no such file or directory"

Cilium pod's log:

config Running
config level=info msg=Invoked duration=8.526806ms function="cmd.glob..func36 (build-config.go:32)" subsys=hive
config level=info msg=Starting subsys=hive
config level=info msg="Establishing connection to apiserver" host="https://10.171.0.1:443" subsys=k8s-client
apply-sysctl-overwrites sysctl config up-to-date, nothing to do
config level=info msg="Connected to apiserver" subsys=k8s-client
config level=info msg="Start hook executed" duration=21.159579ms function="client.(*compositeClientset).onStart" subsys=hive
config level=info msg="Reading configuration from config-map:kube-system/cilium-config" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
Stream closed EOF for kube-system/cilium-hx8mg (apply-sysctl-overwrites)
config level=info msg="Got 111 config pairs from source" configSource="config-map:kube-system/cilium-config" subsys=option-resolver
config level=info msg="Reading configuration from cilium-node-config:kube-system/" configSource="cilium-node-config:kube-system/" subsys=option-resolver
config level=info msg="Got 0 config pairs from source" configSource="cilium-node-config:kube-system/" subsys=option-resolver
config level=info msg="Start hook executed" duration=55.605325ms function="cmd.(*buildConfig).onStart" subsys=hive
config level=info msg=Stopping subsys=hive
config level=info msg="Stop hook executed" duration="186.079┬╡s" function="client.(*compositeClientset).onStop" subsys=hive
Stream closed EOF for kube-system/cilium-hx8mg (config)
mount-cgroup level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups
Stream closed EOF for kube-system/cilium-hx8mg (mount-cgroup)
Stream closed EOF for kube-system/cilium-hx8mg (clean-cilium-state)
mount-bpf-fs none on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
Stream closed EOF for kube-system/cilium-hx8mg (mount-bpf-fs)
install-cni-binaries Installing cilium-cni to /host/opt/cni/bin/ ...
install-cni-binaries wrote /host/opt/cni/bin/cilium-cni
Stream closed EOF for kube-system/cilium-hx8mg (install-cni-binaries)
install-portmap-cni-plugin bandwidth is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin bridge is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin dhcp is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin copied /opt/cni/bin/dummy to /host/opt/cni/bin correctly
install-portmap-cni-plugin firewall is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin flannel is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin host-device is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin host-local is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin ipvlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin loopback is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin macvlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin copied /opt/cni/bin/portmap to /host/opt/cni/bin correctly
install-portmap-cni-plugin ptp is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin sbr is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin static is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin tuning is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin vlan is in SKIP_CNI_BINARIES, skipping
install-portmap-cni-plugin vrf is in SKIP_CNI_BINARIES, skipping
Stream closed EOF for kube-system/cilium-hx8mg (install-portmap-cni-plugin)
cilium-agent level=info msg="Memory available for map entries (0.003% of 18861256704B): 47153141B" subsys=config
cilium-agent level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 82724" subsys=config
cilium-agent level=info msg="option bpf-nat-global-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-neigh-global-max set by dynamic sizing to 165449" subsys=config
cilium-agent level=info msg="option bpf-sock-rev-map-max set by dynamic sizing to 82724" subsys=config
cilium-agent level=info msg="  --agent-health-port='9879'" subsys=daemon
cilium-agent level=info msg="  --agent-labels=''" subsys=daemon
cilium-agent level=info msg="  --agent-liveness-update-interval='1s'" subsys=daemon
cilium-agent level=info msg="  --agent-not-ready-taint-key='node.cilium.io/agent-not-ready'" subsys=daemon
cilium-agent level=info msg="  --allocator-list-timeout='3m0s'" subsys=daemon
cilium-agent level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
cilium-agent level=info msg="  --allow-localhost='auto'" subsys=daemon
cilium-agent level=info msg="  --annotate-k8s-node='false'" subsys=daemon
cilium-agent level=info msg="  --api-rate-limit=''" subsys=daemon
cilium-agent level=info msg="  --arping-refresh-period='30s'" subsys=daemon
cilium-agent level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
cilium-agent level=info msg="  --auto-direct-node-routes='true'" subsys=daemon
cilium-agent level=info msg="  --bgp-announce-lb-ip='false'" subsys=daemon
cilium-agent level=info msg="  --bgp-announce-pod-cidr='false'" subsys=daemon
cilium-agent level=info msg="  --bgp-config-path='/var/lib/cilium/bgp/config.yaml'" subsys=daemon
cilium-agent level=info msg="  --bpf-auth-map-max='524288'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-ct-timeout-service-tcp-grace='1m0s'" subsys=daemon
cilium-agent level=info msg="  --bpf-filter-priority='1'" subsys=daemon
cilium-agent level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-affinity-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dev-ip-addr-inherit=''" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dsr-dispatch='opt'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-dsr-l4-xlate='frontend'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-external-clusterip='false'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-mode='snat'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-rev-nat-map-max='0'" subsys=daemon
cilium-agent level=info msg="  --bpf-lb-rss-ipv4-src-cidr=''" subsys=daemon
Stream closed EOF for kube-system/cilium-hx8mg (cilium-agent)

The text was updated successfully, but these errors were encountered:

brandond · 2023-12-09T06:24:29Z

I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
systemctl stop rke2-server
rke2 server --cluster-reset
systemctl start rke2-server

That's not a restore from snapshot; all you did was reset the etcd cluster membership to a single node. Did you want to actually restore from a snapshot?

lethehoa · 2023-12-11T04:55:19Z

I got some problem with master node, so that I restore etcd config from etcd snapshot. I run the following commands:
systemctl stop rke2-server
rke2 server --cluster-reset
systemctl start rke2-server

That's not a restore from snapshot; all you did was reset the etcd cluster membership to a single node. Did you want to actually restore from a snapshot?

rke2 server
--cluster-reset
--cluster-reset-restore-path=

I also run the command above, is it the right way to restore cluster from etcd snapshot?

brandond · 2023-12-11T05:35:28Z

Yes, restoring from a snapshot requires passing the path to the snapshot to restore, or the filename if using s3. Once it finishes, you should get additional instructions on what to do on the other servers to rejoin them.

lethehoa · 2023-12-11T06:06:18Z

Yes, restoring from a snapshot requires passing the path to the snapshot to restore, or the filename if using s3. Once it finishes, you should get additional instructions on what to do on the other servers to rejoin them.

Thanks for your answer, I follow these step but still got the error related to cilium Hubble. The least option should be reinstall the whole cluster, right?

brandond · 2023-12-11T06:45:11Z

That seems like overkill... have you looked at logs for all the containers in that pod? The error indicates that there is another prior failure that you need to resolve. Something else is failing to create that socket file.

lethehoa · 2023-12-14T02:29:32Z

That seems like overkill... have you looked at logs for all the containers in that pod? The error indicates that there is another prior failure that you need to resolve. Something else is failing to create that socket file.

Thanks for your response. I reinstalled Cilium using Helm, and it worked.

kode15333 · 2025-01-24T08:01:40Z

I encountered a similar issue after rejoining a worker node to the Kubernetes cluster.

error="failed to apply option: listen tcp :4244: bind: address already in use" subsys=hubble

sudo netstat -tulnp | grep :4244
- tcp6       0      0 :::4244                 :::*                    LISTEN      ****/cilium-agent 

sudo fuser -k 4244/tcp

kubectl rollout restart ds/cilium -n kube-system

after my cilium status is okay
Leave it for sharing

lethehoa closed this as completed Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium Hubble Server not initialize after restore etcd snapshot #5122

Cilium Hubble Server not initialize after restore etcd snapshot #5122

lethehoa commented Dec 9, 2023 •

edited

Loading

brandond commented Dec 9, 2023 •

edited

Loading

lethehoa commented Dec 11, 2023

brandond commented Dec 11, 2023 •

edited

Loading

lethehoa commented Dec 11, 2023

brandond commented Dec 11, 2023

lethehoa commented Dec 14, 2023

kode15333 commented Jan 24, 2025

Cilium Hubble Server not initialize after restore etcd snapshot #5122

Cilium Hubble Server not initialize after restore etcd snapshot #5122

Comments

lethehoa commented Dec 9, 2023 • edited Loading

brandond commented Dec 9, 2023 • edited Loading

lethehoa commented Dec 11, 2023

brandond commented Dec 11, 2023 • edited Loading

lethehoa commented Dec 11, 2023

brandond commented Dec 11, 2023

lethehoa commented Dec 14, 2023

kode15333 commented Jan 24, 2025

lethehoa commented Dec 9, 2023 •

edited

Loading

brandond commented Dec 9, 2023 •

edited

Loading

brandond commented Dec 11, 2023 •

edited

Loading