Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to increase speed by using CUDAExecutionprovider from onnxruntime-gpu instead of cpu onnxruntime, met with warning about cpu-gpu transfer bottleneck #62

Open
andrewtvuong opened this issue Jun 20, 2024 · 7 comments

Comments

@andrewtvuong
Copy link

andrewtvuong commented Jun 20, 2024

I want to use CUDA instead of CPU to increase the speed on tag inference.

My machine Ubuntu 22.04.3 LTS (GNU/Linux 6.5.0-35-generic x86_64), CUDA 12.2

I learned from https://onnxruntime.ai/docs/install/ that if you have cuda 12 must install using pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ as of time of writing, instead of simply pip install onnxruntime-gpu which is for cuda 11. This took me a while to figure out. Kept getting errors that didn't make sense:

[E:onnxruntime
, provider_bridge_ort.cc:1744 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

[W:onnxruntime, onnxruntime_pybind_state.cc:870 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirementsto ensure all dependencies are met.

I had those objects. but after reading carefully and reinstalling based on the above for cuda 12 it worked. Using CUDAExecutionprovider instead of CPUExecutionprovider however did cause a new warning:

[W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 12 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.

Basically bottlenecked by CPU/GPU data transfer. Trying to figure out but have not been able to successfully.

@andrewtvuong
Copy link
Author

Opened this pull to show what I tried #63

@stanleyftf1005
Copy link

any luck fixing? having the same issue. WD14 tagger been a pain to run

@andrewtvuong
Copy link
Author

I just started looking into this yesterday, will try to fix it when I have more time. Just starting discussions here in case I miss something.

@KurtCocain
Copy link

was working fine and then it started doing this between 2 generation, no idea what i did. Anyone fixed it yet?

@sddiky
Copy link

sddiky commented Aug 16, 2024

When I have to switch to CUDAExecutionProvider (because other custom node requiring onnxruntime-gpu), the speed of WD14 tagging became very slow.
As long as I can't uninstall onnxruntime-gpu, I just modify ortProviders in pysssss.json from ["CUDAExecutionProvider","CPUExecutionProvider"] to ["CPUExecutionProvider","CUDAExecutionProvider"], and the problem solved.
Hope this helps.

@smallersoup
Copy link

When I have to switch to CUDAExecutionProvider (because other custom node requiring onnxruntime-gpu), the speed of WD14 tagging became very slow. As long as I can't uninstall onnxruntime-gpu, I just modify ortProviders in pysssss.json from ["CUDAExecutionProvider","CPUExecutionProvider"] to ["CPUExecutionProvider","CUDAExecutionProvider"], and the problem solved. Hope this helps.

I'm extremely grateful, the solution you provided has helped me a lot.

@ctrlz526
Copy link

May I ask if your problem has been solved? I have encountered this issue as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants