You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:
I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.
I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.
The text was updated successfully, but these errors were encountered:
I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.
Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:
For context, I'm running pytorch 1.11.0+cu102:
I forgot to copy-paste this part, but when I checked the cuda version via
cat /usr/local/cuda/version.txt
it was 10.2. I don't recall the patch version.I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.
The text was updated successfully, but these errors were encountered: