Agent does not rejoin the cluster after restart #7653

c-datculescu · 2025-01-31T09:50:26Z

Environmental Info:
RKE2 Version: v1.28.3-rke2r1-d81df4077773

Node(s) CPU architecture, OS, and Version: Linux prod-rke2-agent-node04 5.4.0-205-generic #225-Ubuntu SMP Fri Jan 10 22:23:35 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers, 11 agents

Describe the bug:
After restarting the rke2-agent for certificate rotation, the agent does not start with the log message being:

Jan 31 10:45:43 prod-rke2-agent-node04 rke2[938]: time="2025-01-31T10:45:43+01:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

The node is removed from the cluster for the operation, the password secret is not there and additionally the password has been removed from /etc/rancher/node/password. This does not happen for all the nodes in the cluster, only for a few of them.

Steps To Reproduce:
Unfortunately we cannot reproduce this in any of the other clusters that we are running (5 additional clusters).

Installed RKE2: rke 2 is installed using the binaries, no additional configuration provided except the default.

Expected behavior:
The agent should rejoin the cluster with no problems.

Actual behavior:
The node remains hanging and never rejoins the cluster.

Additional context / logs:
No additional logs in any of the server/agent components related to this.

The text was updated successfully, but these errors were encountered:

brandond · 2025-01-31T09:59:19Z

The node is removed from the cluster for the operation, the password secret is not there and additionally the password has been removed from /etc/rancher/node/password.

Something else is doing this. There is no logic in RKE2 to delete nodes from the cluster, nor to delete the node password file off the disk. Some other system management tool or automated process perhaps?

c-datculescu · 2025-01-31T10:10:51Z

@brandond: sorry, i was unclear about this. The node initially does not rejoin the cluster on a simple restart.

We then manually remove the node from the cluster (deleting the node, which does remove the secret in k8s as well), and we remove the password file from disk, which should allow it to start like a completely new node. It still has the same error.

Assumption from our side is that there is some sort of cache involved that does not allow the node to rejoin.

manuelbuil · 2025-01-31T11:28:19Z

@brandond: sorry, i was unclear about this. The node initially does not rejoin the cluster on a simple restart.

We then manually remove the node from the cluster (deleting the node, which does remove the secret in k8s as well), and we remove the password file from disk, which should allow it to start like a completely new node. It still has the same error.

Assumption from our side is that there is some sort of cache involved that does not allow the node to rejoin.

If you didn't remove the node using kubectl delete node, the password for that node would still be kept as a secret inside the cluster. When the node tries to rejoin, the server will expect it to have the same password. If it does not have it, you'll see problems. You can remove the secret manually and then the new node should be able to join the cluster. Type kubectl get secrets -n kube-system and look for a xxxxx.node-password.rke2, where xxxxx is the node name and remove it

c-datculescu · 2025-01-31T11:46:41Z

The node was removed with kubectl delete node. The secret was removed as well. The error persists although the node is no longer registered in the cluster and everything seems clean.

I did some more tests today. After a number of restarts (sometimes 1, sometimes 10+), and removing the /etc/rancher/node/password, it succeeds to connect to the cluster eventually. I am very puzzled by this behavior as i cannot replicate it anywhere except in this cluster.

On monday i have planned a server restart to see if that helps the situation at all.

brandond · 2025-01-31T18:48:12Z

I don't understand. Why are you deleting the node and node password file as part of agent certificate rotation? This is in no way necessary.

I have seen a couple reports of the node password secret cache holding on to stale entries but I have not been able to reproduce it. If this is what's occurring, restarting the server nodes should resolve that.

I'd recommend you stop unnecessarily deleting the nodes though.

brandond closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent does not rejoin the cluster after restart #7653

Agent does not rejoin the cluster after restart #7653

c-datculescu commented Jan 31, 2025 •

edited

Loading

brandond commented Jan 31, 2025 •

edited

Loading

c-datculescu commented Jan 31, 2025 •

edited

Loading

manuelbuil commented Jan 31, 2025

c-datculescu commented Jan 31, 2025

brandond commented Jan 31, 2025

Agent does not rejoin the cluster after restart #7653

Agent does not rejoin the cluster after restart #7653

Comments

c-datculescu commented Jan 31, 2025 • edited Loading

brandond commented Jan 31, 2025 • edited Loading

c-datculescu commented Jan 31, 2025 • edited Loading

manuelbuil commented Jan 31, 2025

c-datculescu commented Jan 31, 2025

brandond commented Jan 31, 2025

c-datculescu commented Jan 31, 2025 •

edited

Loading

brandond commented Jan 31, 2025 •

edited

Loading

c-datculescu commented Jan 31, 2025 •

edited

Loading