-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent does not rejoin the cluster after restart #7653
Comments
Something else is doing this. There is no logic in RKE2 to delete nodes from the cluster, nor to delete the node password file off the disk. Some other system management tool or automated process perhaps? |
@brandond: sorry, i was unclear about this. The node initially does not rejoin the cluster on a simple restart. We then manually remove the node from the cluster (deleting the node, which does remove the secret in k8s as well), and we remove the password file from disk, which should allow it to start like a completely new node. It still has the same error. Assumption from our side is that there is some sort of cache involved that does not allow the node to rejoin. |
If you didn't remove the node using |
The node was removed with I did some more tests today. After a number of restarts (sometimes 1, sometimes 10+), and removing the On monday i have planned a server restart to see if that helps the situation at all. |
I don't understand. Why are you deleting the node and node password file as part of agent certificate rotation? This is in no way necessary. I have seen a couple reports of the node password secret cache holding on to stale entries but I have not been able to reproduce it. If this is what's occurring, restarting the server nodes should resolve that. I'd recommend you stop unnecessarily deleting the nodes though. |
Environmental Info:
RKE2 Version: v1.28.3-rke2r1-d81df4077773
Node(s) CPU architecture, OS, and Version: Linux prod-rke2-agent-node04 5.4.0-205-generic #225-Ubuntu SMP Fri Jan 10 22:23:35 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 3 servers, 11 agents
Describe the bug:
After restarting the rke2-agent for certificate rotation, the agent does not start with the log message being:
The node is removed from the cluster for the operation, the password secret is not there and additionally the password has been removed from /etc/rancher/node/password. This does not happen for all the nodes in the cluster, only for a few of them.
Steps To Reproduce:
Unfortunately we cannot reproduce this in any of the other clusters that we are running (5 additional clusters).
Expected behavior:
The agent should rejoin the cluster with no problems.
Actual behavior:
The node remains hanging and never rejoins the cluster.
Additional context / logs:
No additional logs in any of the server/agent components related to this.
The text was updated successfully, but these errors were encountered: