You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:
GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I suspect it may be an issue with device names (nvidia0 vs nvidia0/gi0), but I'm not entirely sure.
Full logs:
➜ ~ k -n kube-system logs nvidia-gpu-device-plugin-4rcj2
I0518 15:07:21.825159 1 nvidia_gpu.go:75] device-plugin started
I0518 15:07:21.825230 1 nvidia_gpu.go:82] Reading GPU config file: /etc/nvidia/gpu_config.json
I0518 15:07:21.825362 1 nvidia_gpu.go:91] Using gpu config: {7g.40gb 0 { 0} []}
E0518 15:07:27.545855 1 nvidia_gpu.go:117] failed to start GPU device manager: failed to start mig device manager: Number of partitions (0) for GPU 0 does not match expected partition count (1)
I0518 15:07:32.547969 1 mig.go:175] Discovered GPU partition: nvidia0/gi0
I0518 15:07:32.549461 1 nvidia_gpu.go:122] Starting metrics server on port: 2112, endpoint path: /metrics, collection frequency: 30000
I0518 15:07:32.550354 1 metrics.go:134] Starting metrics server
I0518 15:07:32.550430 1 metrics.go:140] nvml initialized successfully. Driver version: 470.161.03
I0518 15:07:32.550446 1 devices.go:115] Found 1 GPU devices
I0518 15:07:32.556369 1 devices.go:126] Found device nvidia0 for metrics collection
I0518 15:07:32.556430 1 health_checker.go:65] Starting GPU Health Checker
I0518 15:07:32.556440 1 health_checker.go:68] Healthchecker receives device nvidia0/gi0, device {nvidia0/gi0 Healthy nil {} 0}+
I0518 15:07:32.556475 1 health_checker.go:77] Found 1 GPU devices
I0518 15:07:32.556667 1 health_checker.go:145] HealthChecker detects MIG is enabled on device nvidia0
I0518 15:07:32.560599 1 health_checker.go:164] Found mig device nvidia0/gi0 for health monitoring. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561030 1 health_checker.go:113] Registering device /dev/nvidia0. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561195 1 manager.go:385] will use alpha API
I0518 15:07:32.561206 1 manager.go:399] starting device-plugin server at: /device-plugin/nvidiaGPU-1684422452.sock
I0518 15:07:32.561393 1 manager.go:426] device-plugin server started serving
I0518 15:07:32.564986 1 beta_plugin.go:40] device-plugin: ListAndWatch start
I0518 15:07:32.565040 1 manager.go:434] device-plugin registered with the kubelet
I0518 15:07:32.565003 1 beta_plugin.go:138] ListAndWatch: send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:nvidia0/gi0,Health:Healthy,Topology:nil,},},}
E0518 15:08:02.557919 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:02.643081 1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:08:32.557732 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:32.683032 1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:
I suspect it may be an issue with device names (
nvidia0
vsnvidia0/gi0
), but I'm not entirely sure.Full logs:
Downstream issue: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/76.
The text was updated successfully, but these errors were encountered: