Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent does not rejoin the cluster after restart #7653

Closed
c-datculescu opened this issue Jan 31, 2025 · 5 comments
Closed

Agent does not rejoin the cluster after restart #7653

c-datculescu opened this issue Jan 31, 2025 · 5 comments

Comments

@c-datculescu
Copy link

c-datculescu commented Jan 31, 2025

Environmental Info:
RKE2 Version: v1.28.3-rke2r1-d81df4077773

Node(s) CPU architecture, OS, and Version: Linux prod-rke2-agent-node04 5.4.0-205-generic #225-Ubuntu SMP Fri Jan 10 22:23:35 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers, 11 agents

Describe the bug:
After restarting the rke2-agent for certificate rotation, the agent does not start with the log message being:

Jan 31 10:45:43 prod-rke2-agent-node04 rke2[938]: time="2025-01-31T10:45:43+01:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag"

The node is removed from the cluster for the operation, the password secret is not there and additionally the password has been removed from /etc/rancher/node/password. This does not happen for all the nodes in the cluster, only for a few of them.

Steps To Reproduce:
Unfortunately we cannot reproduce this in any of the other clusters that we are running (5 additional clusters).

  • Installed RKE2: rke 2 is installed using the binaries, no additional configuration provided except the default.

Expected behavior:
The agent should rejoin the cluster with no problems.

Actual behavior:
The node remains hanging and never rejoins the cluster.

Additional context / logs:
No additional logs in any of the server/agent components related to this.

@brandond
Copy link
Member

brandond commented Jan 31, 2025

The node is removed from the cluster for the operation, the password secret is not there and additionally the password has been removed from /etc/rancher/node/password.

Something else is doing this. There is no logic in RKE2 to delete nodes from the cluster, nor to delete the node password file off the disk. Some other system management tool or automated process perhaps?

@c-datculescu
Copy link
Author

c-datculescu commented Jan 31, 2025

@brandond: sorry, i was unclear about this. The node initially does not rejoin the cluster on a simple restart.

We then manually remove the node from the cluster (deleting the node, which does remove the secret in k8s as well), and we remove the password file from disk, which should allow it to start like a completely new node. It still has the same error.

Assumption from our side is that there is some sort of cache involved that does not allow the node to rejoin.

@manuelbuil
Copy link
Contributor

@brandond: sorry, i was unclear about this. The node initially does not rejoin the cluster on a simple restart.

We then manually remove the node from the cluster (deleting the node, which does remove the secret in k8s as well), and we remove the password file from disk, which should allow it to start like a completely new node. It still has the same error.

Assumption from our side is that there is some sort of cache involved that does not allow the node to rejoin.

If you didn't remove the node using kubectl delete node, the password for that node would still be kept as a secret inside the cluster. When the node tries to rejoin, the server will expect it to have the same password. If it does not have it, you'll see problems. You can remove the secret manually and then the new node should be able to join the cluster. Type kubectl get secrets -n kube-system and look for a xxxxx.node-password.rke2, where xxxxx is the node name and remove it

@c-datculescu
Copy link
Author

The node was removed with kubectl delete node. The secret was removed as well. The error persists although the node is no longer registered in the cluster and everything seems clean.

I did some more tests today. After a number of restarts (sometimes 1, sometimes 10+), and removing the /etc/rancher/node/password, it succeeds to connect to the cluster eventually. I am very puzzled by this behavior as i cannot replicate it anywhere except in this cluster.

On monday i have planned a server restart to see if that helps the situation at all.

@brandond
Copy link
Member

I don't understand. Why are you deleting the node and node password file as part of agent certificate rotation? This is in no way necessary.

I have seen a couple reports of the node password secret cache holding on to stale entries but I have not been able to reproduce it. If this is what's occurring, restarting the server nodes should resolve that.

I'd recommend you stop unnecessarily deleting the nodes though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants