Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics export broken due to device naming mismatch #296

Open
igorwwwwwwwwwwwwwwwwwwww opened this issue May 24, 2023 · 0 comments
Open

Comments

@igorwwwwwwwwwwwwwwwwwwww
Copy link

igorwwwwwwwwwwwwwwwwwwww commented May 24, 2023

We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:

GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

I suspect it may be an issue with device names (nvidia0 vs nvidia0/gi0), but I'm not entirely sure.

Full logs:

➜  ~ k -n kube-system logs nvidia-gpu-device-plugin-4rcj2
I0518 15:07:21.825159       1 nvidia_gpu.go:75] device-plugin started
I0518 15:07:21.825230       1 nvidia_gpu.go:82] Reading GPU config file: /etc/nvidia/gpu_config.json
I0518 15:07:21.825362       1 nvidia_gpu.go:91] Using gpu config: {7g.40gb 0 { 0} []}
E0518 15:07:27.545855       1 nvidia_gpu.go:117] failed to start GPU device manager: failed to start mig device manager: Number of partitions (0) for GPU 0 does not match expected partition count (1)
I0518 15:07:32.547969       1 mig.go:175] Discovered GPU partition: nvidia0/gi0
I0518 15:07:32.549461       1 nvidia_gpu.go:122] Starting metrics server on port: 2112, endpoint path: /metrics, collection frequency: 30000
I0518 15:07:32.550354       1 metrics.go:134] Starting metrics server
I0518 15:07:32.550430       1 metrics.go:140] nvml initialized successfully. Driver version: 470.161.03
I0518 15:07:32.550446       1 devices.go:115] Found 1 GPU devices
I0518 15:07:32.556369       1 devices.go:126] Found device nvidia0 for metrics collection
I0518 15:07:32.556430       1 health_checker.go:65] Starting GPU Health Checker
I0518 15:07:32.556440       1 health_checker.go:68] Healthchecker receives device nvidia0/gi0, device {nvidia0/gi0 Healthy nil {} 0}+
I0518 15:07:32.556475       1 health_checker.go:77] Found 1 GPU devices
I0518 15:07:32.556667       1 health_checker.go:145] HealthChecker detects MIG is enabled on device nvidia0
I0518 15:07:32.560599       1 health_checker.go:164] Found mig device nvidia0/gi0 for health monitoring. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561030       1 health_checker.go:113] Registering device /dev/nvidia0. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561195       1 manager.go:385] will use alpha API
I0518 15:07:32.561206       1 manager.go:399] starting device-plugin server at: /device-plugin/nvidiaGPU-1684422452.sock
I0518 15:07:32.561393       1 manager.go:426] device-plugin server started serving
I0518 15:07:32.564986       1 beta_plugin.go:40] device-plugin: ListAndWatch start
I0518 15:07:32.565040       1 manager.go:434] device-plugin registered with the kubelet
I0518 15:07:32.565003       1 beta_plugin.go:138] ListAndWatch: send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:nvidia0/gi0,Health:Healthy,Topology:nil,},},}
E0518 15:08:02.557919       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:02.643081       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:08:32.557732       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:32.683032       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

Downstream issue: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/76.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant