Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet memory leak when a plugin is registered twice #124716

Open
Black-max12138 opened this issue May 7, 2024 · 7 comments · May be fixed by #124719
Open

Kubelet memory leak when a plugin is registered twice #124716

Black-max12138 opened this issue May 7, 2024 · 7 comments · May be fixed by #124719
Assignees
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Black-max12138
Copy link

Black-max12138 commented May 7, 2024

What happened?

We found that the kubelet memory kept increasing,and we exported the pprof of the goroutine. The grpc goroutine leaks, causing memory leakage.
image
We found out that the reason was because one of the pluginThe following code causes this situation. One client is lost.s kept registering twice and using the same name for both.

func (s *server) registerClient(name string, c Client) {
s.mutex.Lock()
defer s.mutex.Unlock()
s.clients[name] = c
klog.V(2).InfoS("Registered client", "name", name)
}

When two requests are registered at the same time, only one client is reserved in s.clients.
func (s *server) runClient(name string, c Client) {
c.Run()
c = s.getClient(name)
if c == nil {
return
}
if err := s.disconnectClient(name, c); err != nil {
klog.V(2).InfoS("Unable to disconnect client", "resource", name, "client", c, "err", err)
}
}

Therefore, after the c.Run () method in the runClient method is executed, s.getClient obtains only one registered client. As a result, the c.grpc.Close () method is not invoked, causing memory and coroutine leakage.

What did you expect to happen?

Even in this case, kubelet should not leak memory.

How can we reproduce it (as minimally and precisely as possible)?

1、The plug-in is registered every 5 seconds and two registration requests are sent at the same time.
2、The kubelet memory usage keeps increasing.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# 1.28

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Black-max12138 Black-max12138 added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 7, 2024
@Black-max12138
Copy link
Author

/area kubelet

@ffromani
Copy link
Contributor

ffromani commented May 7, 2024

/sig-node
/triage accepted
/priority backlog

I agree this seems a bug, however I wonder how often plugins fight each other for registration and how severe the memory leak is. With precise numbers we can re-evaluate the prioritization.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 7, 2024
@carlory
Copy link
Member

carlory commented May 7, 2024

/assign

@Black-max12138
Copy link
Author

/sig-node /triage accepted /priority backlog

I agree this seems a bug, however I wonder how often plugins fight each other for registration and how severe the memory leak is. With precise numbers we can re-evaluate the prioritization.

A plug-in is registered every 5 seconds, and two registration requests are sent each time. The size of the kubelet process increases by 1 GB in about four days. The kubelet plug-in registration log is as follows:

E0506 02:43:49.962434 11715 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="fuse"
I0506 02:43:49.962523 11715 handler.go:102] "Deregistered client" name="fuse"
I0506 02:43:49.962565 11715 manager.go:367] "Mark all resources Unhealthy for resource" resourceName="fuse"
I0506 02:43:49.962707 11715 manager.go:241] "Endpoint became unhealthy" resourceName="fuse" endpoint={}
E0506 02:43:49.962466 11715 client.go:88] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="fuse"
I0506 02:43:49.963640 11715 server.go:147] "Got registration request from device plugin with resource" resourceName="fuse"
I0506 02:43:49.963718 11715 handler.go:94] "Registered client" name="fuse"
I0506 02:43:49.963914 11715 server.go:147] "Got registration request from device plugin with resource" resourceName="fuse"
I0506 02:43:49.963988 11715 handler.go:94] "Registered client" name="fuse"
I0506 02:43:49.964423 11715 manager.go:229] "Device plugin connected" resourceName="fuse"
I0506 02:43:49.964689 11715 manager.go:229] "Device plugin connected" resourceName="fuse"
I0506 02:43:49.965105 11715 client.go:91] "State pushed for device plugin" resource="fuse" resourceCapacity=512
I0506 02:43:49.966985 11715 client.go:91] "State pushed for device plugin" resource="fuse" resourceCapacity=512
I0506 02:43:49.975355 11715 manager.go:278] "Processed device updates for resource" resourceName="fuse" totalCount=512 healthyCount=512
I0506 02:43:49.982987 11715 manager.go:278] "Processed device updates for resource" resourceName="fuse" totalCount=512 healthyCount=512

@carlory
Copy link
Member

carlory commented May 8, 2024

@Black-max12138

How your device plugin is registered in kubelet?

  • Model 1: plugin registers with Kubelet through grpc
  • Model 2: Kubelet watches new plugins under a canonical path through inotify

According to the message Got registration request from device plugin with resource and "Deregistered client" name="fuse" your pasted, it seems that both models are used in your case. Why? Is it expected by design? cc @ffromani

FYI: https://github.com/vikaschoudhary16/community/blob/1f85184ff266aec51482a2b1112a779e2201a564/contributors/design-proposals/node/plugin-watcher.md

@Black-max12138
Copy link
Author

@Black-max12138

How your device plugin is registered in kubelet?

  • Model 1: plugin registers with Kubelet through grpc
  • Model 2: Kubelet watches new plugins under a canonical path through inotify

According to the message Got registration request from device plugin with resource and "Deregistered client" name="fuse" your pasted, it seems that both models are used in your case. Why? Is it expected by design? cc @ffromani

FYI: https://github.com/vikaschoudhary16/community/blob/1f85184ff266aec51482a2b1112a779e2201a564/contributors/design-proposals/node/plugin-watcher.md

We are using mode 1,but it will repeat the registration every 5 seconds.

@ffromani
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 10, 2024
@pacoxu pacoxu added this to Triage in SIG Node Bugs May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants