Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Upgrade test frameworks #3979

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Sep 6, 2023

Upgrades:

  • torch to 2.0.1 and 2.1.2
  • tensorflow 2.13.0 and 2.14.0
  • spark 3.5.0

@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from aeb4751 to cccd59e Compare September 6, 2023 08:55
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Unit Test Results

   522 files   -   202    522 suites   - 202   5h 57m 35s ⏱️ - 2h 8m 32s
   887 tests ±    0    639 ✅  -   129    195 💤 + 76   47 ❌ + 47   6 🔥 + 6 
11 590 runs   - 4 619  7 482 ✅  - 3 871  3 966 💤  - 890  130 ❌ +130  12 🔥 +12 

For more details on these failures and errors, see this check.

Results for commit e11592b. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Unit Test Results (with flaky tests)

   874 files   -    14     874 suites   - 14   11h 20m 57s ⏱️ + 2h 22m 16s
   887 tests ±    0     639 ✅  -   129    195 💤 + 76   47 ❌ + 47   6 🔥 + 6 
19 064 runs   - 1 175  12 072 ✅  - 1 717  6 566 💤 +116  390 ❌ +390  36 🔥 +36 

For more details on these failures and errors, see this check.

Results for commit e11592b. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Sep 6, 2023

@liangz1 @irasit @WeichenXu123 @chongxiaoc There is an issue with test test_dbfs_local_store in test/integration/test_spark.py when run against tensorflow 2.13. Can one of you have a look, please?

Here is a log of the failure:
https://github.com/horovod/horovod/actions/runs/6097139901/job/16544192396#step:207:168-176

@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 6910092 to 8b1da81 Compare September 6, 2023 20:22
@chongxiaoc
Copy link
Collaborator

hmm, this is specific logic from databricks about model serialization and deserialization.
Not familiar with this.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Sep 7, 2023

Not familiar with this.

No worries, thanks for the response!

@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 1dde691 to 417ad99 Compare September 27, 2023 19:10
@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from a91a1c5 to d77ce48 Compare October 27, 2023 17:58
@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from e954cd8 to e31c6e7 Compare December 29, 2023 15:10
@EnricoMi EnricoMi changed the base branch from master to ci-pin-docker December 29, 2023 15:11
@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from 5cd1d4d to 9f7722d Compare December 30, 2023 13:11
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 30, 2023

Seeing std::bad_alloc when calling into horovod.torch.init():

[0]<stderr>:terminate called after throwing an instance of 'std::bad_alloc'
[0]<stderr>:  what():  std::bad_alloc

Base automatically changed from ci-pin-docker to master January 5, 2024 20:04
@thomas-bouvier
Copy link
Contributor

I can't reproduce the bad alloc locally when compiling against PyTorch 2.1.0. Could I have more information about your environment? Was error triggered in the test pipeline?

@EnricoMi EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 9f7722d to e11592b Compare January 15, 2024 08:14
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Jan 15, 2024

That can be seen in the CI: https://github.com/horovod/horovod/actions/runs/7364310753/job/20044572743#step:38:30

[0]<stderr>:terminate called after throwing an instance of 'std::bad_alloc'
[0]<stderr>:  what():  std::bad_alloc

As well as in the test results: See test.integration.test_interactiverun.InteractiveRunTests.

To reproduce:

git remote add upstream https://github.com/horovod/horovod.git
git fetch upstream
git checkout upstream/branch-ci-upgrade-test-frameworks-9
docker compose -f docker-compose.test.yml build test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0
docker run --rm -it horovod-test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0 /bin/bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"

@EnricoMi
Copy link
Collaborator Author

I can't reproduce the bad alloc locally when compiling against PyTorch 2.1.0.

Have you compiled together with Gloo support? That might be related.

@thomas-bouvier
Copy link
Contributor

I could reproduce the bad_alloc using the following:

docker run --rm -it horovod-test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0 /bin/bash -c " cd /horovod/test/parallel && (ls -1 test_torch.py | xargs -n 1 horovodrun -np 1 -H localhost:1 --gloo /bin/bash /pytest.sh gloo)"

I can't identify the cause for now.

@EnricoMi
Copy link
Collaborator Author

Earlier you said you were able to compile against PyTorch 2.1.0 and run the tests successfully: #3979 (comment)

How was the setup different to the bad_alloc setup?

@thomas-bouvier
Copy link
Contributor

thomas-bouvier commented Jan 19, 2024

I initially compiled Horovod myself using my own scripts. There are probably too many differences with the Docker image, so I propose to rely on the Docker recipe instead.

I spent some time debugging the bad malloc and it seems related to gloo indeed. Specifically, the device creation here, with the bad malloc being triggered here on the gloo side. Still investigating to understand why is that!

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Jan 20, 2024

May this be related to me incorrectly trying to compile Gloo with C++17: e11592b

@EnricoMi
Copy link
Collaborator Author

I spent some time debugging the bad malloc and it seems related to gloo indeed. Specifically, the device creation here, with the bad malloc being triggered here on the gloo side. Still investigating to understand why is that!

@maxhgerlach any idea from a quick look at these spots what this might be related to?

@EnricoMi
Copy link
Collaborator Author

@thomas-bouvier do you have any new findings regarding this bad_alloc?

@thomas-bouvier
Copy link
Contributor

I couldn't make any progress since last time. Feel free to investigate on your side if you have some time. I will get back to it eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants