Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme CPU Usage on small cluster #10396

Closed
Bonn93 opened this issue Jun 24, 2024 · 3 comments
Closed

Extreme CPU Usage on small cluster #10396

Bonn93 opened this issue Jun 24, 2024 · 3 comments

Comments

@Bonn93
Copy link

Bonn93 commented Jun 24, 2024

Environmental Info:
K3s Version:

k3s version v1.30.1+k3s1 (80978b5b)
go version go1.22.2

Node(s) CPU architecture, OS, and Version:

Linux k3s-server.internal.self-hosted.io 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 server, 2 agents

Describe the bug:
k3s server has extremely high CPU usage, even scaling pods back to 0 there's no change. A 4 core/16GB RAM machine has a load average of and often spikes to 50/60!

load average: 13.42, 20.74, 22.08

Steps To Reproduce:
k3s has been running for a while with the upgrade controller/operator.

Expected behavior:
CPU is within sane numbers

Actual behavior:
Extremely high CPU usage, slow API response times and often timeouts. Appears to spam /var/log/messages with trace logging.

Additional context / logs:
Nodes have NVMe local drives. The k3s-server process is using 400%, there's no other large processes on the system. It has high %USR with about 20% SYS/Kernel time. The state.db is also 12GB~ and SQLLite fails to vacuum. Cluster age is 270d.

@Bonn93
Copy link
Author

Bonn93 commented Jun 24, 2024

Okay, this is certainly SQLLite related. Migrating to etcd the cluster has returned to normal levels by adding --cluster-init to the systemd unit, however the Trace: logging is still present.

The server is now at 0.4 1m load average and responding to requests quickly. The SQLLite vacuum did not do anything, but migrating to etcd did. Trying to set -v=0 in the systemd unit doesn't seem to have an effect on the logs either.

Happy to close as the root issue seems clear.

@Bonn93 Bonn93 closed this as completed Jun 24, 2024
@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Jun 24, 2024
@paketb0te
Copy link

@Bonn93 we are running into a similar issue - I found that reinstalling the cluster immediately improved the situation (doesn't matter if I reinstall with or without the --cluster-init flag to enabl etcd).
Did using etcd as the datastore fix the issue for you in the long run?

@Bonn93
Copy link
Author

Bonn93 commented Feb 12, 2025

Yeah, I've bootstrapped a few smaller clusters recently and the smaller ones with the non ha control plane all get like this in the long run. Doing the init flag and converting to etcd fixes and the other clusters long run are all happy.

I just bootstrap with etcd single node now. I don't think sqllite is the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants