-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allreduce cpu example fails with CCL_WORKER_COUNT > 1 #109
Comments
Possible workaround: FI_PROVIDER=verbs CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test FI_PROVIDER=tcp CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test |
@piotrchmiel Hi. Your fi_info should say that psm3 is available for you, do you see that? Please execute it and check. https://github.com/oneapi-src/oneCCL/tree/master/deps/ofi/bin |
@piotrchmiel , you can try this. |
I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .
I modified it slightly by increasing the buffer size 100 times:
When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:
With CCL_WORKER_COUNT=1 it works perfect.
What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.
Attached CCL_LOG_LEVEL=info logs.txt
Attached CCL_LOG_LEVEL=debug logs_debug.txt
The text was updated successfully, but these errors were encountered: