CUDA unknown error when checking torch.cuda.is_available #248

csaroff · 2022-09-13T15:53:23Z

I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.

Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:

>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0

For context, I'm running pytorch 1.11.0+cu102:

>>> torch.__version__
'1.11.0+cu102'

I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.

I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA unknown error when checking torch.cuda.is_available #248

CUDA unknown error when checking torch.cuda.is_available #248

csaroff commented Sep 13, 2022

CUDA unknown error when checking torch.cuda.is_available #248

CUDA unknown error when checking torch.cuda.is_available #248

Comments

csaroff commented Sep 13, 2022