-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Upgrade test frameworks #3979
base: master
Are you sure you want to change the base?
Conversation
aeb4751
to
cccd59e
Compare
Unit Test Results 522 files - 202 522 suites - 202 5h 57m 35s ⏱️ - 2h 8m 32s For more details on these failures and errors, see this check. Results for commit e11592b. ± Comparison against base commit 9f88e1d. This pull request skips 76 tests.
♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 874 files - 14 874 suites - 14 11h 20m 57s ⏱️ + 2h 22m 16s For more details on these failures and errors, see this check. Results for commit e11592b. ± Comparison against base commit 9f88e1d. This pull request skips 76 tests.
♻️ This comment has been updated with latest results. |
@liangz1 @irasit @WeichenXu123 @chongxiaoc There is an issue with test Here is a log of the failure: |
6910092
to
8b1da81
Compare
hmm, this is specific logic from |
No worries, thanks for the response! |
1dde691
to
417ad99
Compare
a91a1c5
to
d77ce48
Compare
e954cd8
to
e31c6e7
Compare
5cd1d4d
to
9f7722d
Compare
Seeing
|
Signed-off-by: Enrico Minack <[email protected]>
Signed-off-by: Enrico Minack <[email protected]>
Signed-off-by: Enrico Minack <[email protected]>
Signed-off-by: Enrico Minack <[email protected]>
Signed-off-by: Enrico Minack <[email protected]>
Signed-off-by: Enrico Minack <[email protected]>
I can't reproduce the bad alloc locally when compiling against PyTorch 2.1.0. Could I have more information about your environment? Was error triggered in the test pipeline? |
9f7722d
to
e11592b
Compare
That can be seen in the CI: https://github.com/horovod/horovod/actions/runs/7364310753/job/20044572743#step:38:30
As well as in the test results: See To reproduce: git remote add upstream https://github.com/horovod/horovod.git
git fetch upstream
git checkout upstream/branch-ci-upgrade-test-frameworks-9
docker compose -f docker-compose.test.yml build test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0
docker run --rm -it horovod-test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0 /bin/bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)" |
Have you compiled together with Gloo support? That might be related. |
I could reproduce the
I can't identify the cause for now. |
Earlier you said you were able to compile against PyTorch 2.1.0 and run the tests successfully: #3979 (comment) How was the setup different to the |
I initially compiled Horovod myself using my own scripts. There are probably too many differences with the Docker image, so I propose to rely on the Docker recipe instead. I spent some time debugging the bad malloc and it seems related to gloo indeed. Specifically, the device creation here, with the bad malloc being triggered here on the gloo side. Still investigating to understand why is that! |
May this be related to me incorrectly trying to compile Gloo with C++17: e11592b |
@maxhgerlach any idea from a quick look at these spots what this might be related to? |
@thomas-bouvier do you have any new findings regarding this |
I couldn't make any progress since last time. Feel free to investigate on your side if you have some time. I will get back to it eventually. |
Upgrades: