Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA unknown error when checking torch.cuda.is_available #248

Open
csaroff opened this issue Sep 13, 2022 · 0 comments
Open

CUDA unknown error when checking torch.cuda.is_available #248

csaroff opened this issue Sep 13, 2022 · 0 comments

Comments

@csaroff
Copy link

csaroff commented Sep 13, 2022

I'm running torchserve in GKE and I've installed the nvidia-driver-installer according to the torchserve gpu installation instructions for GKE.

Unfortunately, after a recent reboot of a kubernetes GPU node, my torchserve models failed to start. At startup, they check if a GPU is available which results in the following error:

>>> torch.cuda.is_available()
/home/venv/lib/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0

For context, I'm running pytorch 1.11.0+cu102:

>>> torch.__version__
'1.11.0+cu102'

I forgot to copy-paste this part, but when I checked the cuda version via cat /usr/local/cuda/version.txt it was 10.2. I don't recall the patch version.

I was able to resolve the issue by scaling my gpu node pool down to zero, and then rescaling it back up. Luckily, this issue impacted a node in our staging cluster, but it could just as easily have been a production node so it would be great to understand what went wrong here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant