GPU Partial Usage and Detection Issues on HPC

Hi,  

I'm installing ReLERNN on an HPC and need to set it up within a Conda environment that includes a specific CUDA version.  

However, every execution or environment change I've made results in only partial GPU utilization. The GPU cards are requested multiple times—sometimes they are detected but followed by an error (see below), and other times they are not detected at all **within the same execution**. 

Even more frustrating, the process appears in the NVIDIA-SMI panel, consuming around 4GB of VRAM.

I do install TensorFlow 2.15 and TensorFlow RT using `pip`.  

If you need additional error logs, I can provide them.  

Thanks for your help!  

GPU found but with some errors that i still struggle to fix - mid of execution
```Error Log, stderr -- gpu was found encounter an error 
2025-03-26 15:05:07.401896: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different 
computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-26 15:05:07.403892: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 15:05:07.440774: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 15:05:07.441209: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-26 15:05:07.934751: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-03-26 15:05:08.450044: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA n
ode, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-03-26 15:05:08.487091: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if 
you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
```

GPU not found - start and end of execution
```Error Log, stderr -- gpu wasn't found
2025-03-26 15:03:56.835916: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different 
computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-26 15:03:56.837956: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 15:03:56.873827: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-26 15:03:56.874272: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-26 15:03:57.352307: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Partial Usage and Detection Issues on HPC #65

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Partial Usage and Detection Issues on HPC #65

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions