Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Error with AMD GPU Helm Chart Installation in Kubernetes #48

Open
maarten-blokker opened this issue Jan 13, 2024 · 2 comments
Open

Comments

@maarten-blokker
Copy link
Contributor

maarten-blokker commented Jan 13, 2024

I am experiencing a runtime error while trying to install the AMD GPU Helm Chart (link: AMD GPU Helm Chart) in my Kubernetes cluster. The pod spawned by the daemonset fails to run, and the error log indicates a segmentation violation (SIGSEGV), perhaps related to some permission issues?

I0113 15:54:58.192567       1 main.go:305] AMD GPU device plugin for Kubernetes
I0113 15:54:58.192606       1 main.go:305] ./k8s-device-plugin version v1.18.1-27-g5eb0a0f
I0113 15:54:58.192608       1 main.go:305] hwloc: _VERSION: 2.9.2, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0113 15:54:58.192613       1 manager.go:42] Starting device plugin manager
I0113 15:54:58.192617       1 manager.go:46] Registering for system signal notifications
I0113 15:54:58.192738       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x5333f9]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
	/go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:75 +0x19
panic({0x8e5fc0, 0xd3af30})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
	/go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Add(0x0, {0x986f88?, 0xc000116000?})
	/go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:94 +0x6b
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc00006be70)
	/go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x226
main.main()
	/go/src/github.com/RadeonOpenCompute/k8s-device-plugin/cmd/k8s-device-plugin/main.go:331 +0x4bb

Steps to Reproduce:

  • Followed the installation instructions for the AMD GPU Helm Chart from Artifact Hub.
  • Deployed the chart to my Kubernetes cluster.
  • Observed the pod failing to start, logging the above error.

Expected Behavior:
The AMD GPU device plugin should install without errors and run successfully in the Kubernetes cluster.

Actual Behavior:
The pod crashes immediately after starting, with the log indicating a segmentation fault.

Extra information

  • Kubernetes cluster version: Server Version: v1.27.6+k3s1
  • AMD CPU: AMD Ryzen 9 7940hs
@mishak87
Copy link

mishak87 commented May 25, 2024

I have similar error with latest version

I0525 16:11:36.754627       1 main.go:305] AMD GPU device plugin for Kubernetes
I0525 16:11:36.754694       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-7-g813f150
I0525 16:11:36.754701       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0525 16:11:36.754709       1 manager.go:42] Starting device plugin manager
I0525 16:11:36.754719       1 manager.go:46] Registering for system signal notifications
I0525 16:11:36.754985       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x53bbf3]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:75 +0x13
panic({0x8cf480?, 0xd53320?})
	/usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Add(0x0, {0x977505?, 0xc000204010?})
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:94 +0x57
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc000033550)
	/go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x21f
main.main()
	/go/src/github.com/ROCm/k8s-device-plugin/cmd/k8s-device-plugin/main.go:331 +0x4e9

EDIT: Installing latest ROCm and hard restart fixed the issue.
https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb

@mishak87
Copy link

After hard restart it happened again :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants