[CI] Upgrade test frameworks #3979

EnricoMi · 2023-09-06T08:27:53Z

Upgrades:

torch to 2.0.1 and 2.1.2
tensorflow 2.13.0 and 2.14.0
spark 3.5.0

github-actions · 2023-09-06T15:34:12Z

Unit Test Results

522 files - 202 522 suites - 202 5h 57m 35s ⏱️ - 2h 8m 32s
887 tests ± 0 639 ✅ - 129 195 💤 + 76 47 ❌ + 47 6 🔥 + 6
11 590 runs - 4 619 7 482 ✅ - 3 871 3 966 💤 - 890 130 ❌ +130 12 🔥 +12

For more details on these failures and errors, see this check.

Results for commit e11592b. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.

test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

♻️ This comment has been updated with latest results.

github-actions · 2023-09-06T15:35:41Z

Unit Test Results (with flaky tests)

874 files - 14 874 suites - 14 11h 20m 57s ⏱️ + 2h 22m 16s
887 tests ± 0 639 ✅ - 129 195 💤 + 76 47 ❌ + 47 6 🔥 + 6
19 064 runs - 1 175 12 072 ✅ - 1 717 6 566 💤 +116 390 ❌ +390 36 🔥 +36

For more details on these failures and errors, see this check.

Results for commit e11592b. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.

test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

♻️ This comment has been updated with latest results.

EnricoMi · 2023-09-06T20:07:47Z

@liangz1 @irasit @WeichenXu123 @chongxiaoc There is an issue with test test_dbfs_local_store in test/integration/test_spark.py when run against tensorflow 2.13. Can one of you have a look, please?

Here is a log of the failure:
https://github.com/horovod/horovod/actions/runs/6097139901/job/16544192396#step:207:168-176

chongxiaoc · 2023-09-07T07:54:55Z

hmm, this is specific logic from databricks about model serialization and deserialization.
Not familiar with this.

EnricoMi · 2023-09-07T08:03:46Z

Not familiar with this.

No worries, thanks for the response!

EnricoMi · 2023-12-30T19:59:47Z

Seeing std::bad_alloc when calling into horovod.torch.init():

[0]<stderr>:terminate called after throwing an instance of 'std::bad_alloc'
[0]<stderr>:  what():  std::bad_alloc

Signed-off-by: Enrico Minack <[email protected]>

thomas-bouvier · 2024-01-15T00:17:11Z

I can't reproduce the bad alloc locally when compiling against PyTorch 2.1.0. Could I have more information about your environment? Was error triggered in the test pipeline?

EnricoMi · 2024-01-15T08:25:35Z

That can be seen in the CI: https://github.com/horovod/horovod/actions/runs/7364310753/job/20044572743#step:38:30

[0]<stderr>:terminate called after throwing an instance of 'std::bad_alloc'
[0]<stderr>:  what():  std::bad_alloc

As well as in the test results: See test.integration.test_interactiverun.InteractiveRunTests.

To reproduce:

git remote add upstream https://github.com/horovod/horovod.git
git fetch upstream
git checkout upstream/branch-ci-upgrade-test-frameworks-9
docker compose -f docker-compose.test.yml build test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0
docker run --rm -it horovod-test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0 /bin/bash -c " cd /horovod/test/parallel && (ls -1 test_*.py | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo /bin/bash /pytest.sh gloo)"

EnricoMi · 2024-01-15T13:37:18Z

I can't reproduce the bad alloc locally when compiling against PyTorch 2.1.0.

Have you compiled together with Gloo support? That might be related.

thomas-bouvier · 2024-01-19T15:48:48Z

I could reproduce the bad_alloc using the following:

docker run --rm -it horovod-test-cpu-gloo-py3_9-tf2_14_0-keras2_14_0-torch2_1_2-mxnet1_9_1-pyspark3_4_0 /bin/bash -c " cd /horovod/test/parallel && (ls -1 test_torch.py | xargs -n 1 horovodrun -np 1 -H localhost:1 --gloo /bin/bash /pytest.sh gloo)"

I can't identify the cause for now.

EnricoMi · 2024-01-19T17:07:45Z

Earlier you said you were able to compile against PyTorch 2.1.0 and run the tests successfully: #3979 (comment)

How was the setup different to the bad_alloc setup?

thomas-bouvier · 2024-01-19T19:44:18Z

I initially compiled Horovod myself using my own scripts. There are probably too many differences with the Docker image, so I propose to rely on the Docker recipe instead.

I spent some time debugging the bad malloc and it seems related to gloo indeed. Specifically, the device creation here, with the bad malloc being triggered here on the gloo side. Still investigating to understand why is that!

EnricoMi · 2024-01-20T15:31:40Z

May this be related to me incorrectly trying to compile Gloo with C++17: e11592b

EnricoMi · 2024-01-20T19:09:59Z

I spent some time debugging the bad malloc and it seems related to gloo indeed. Specifically, the device creation here, with the bad malloc being triggered here on the gloo side. Still investigating to understand why is that!

@maxhgerlach any idea from a quick look at these spots what this might be related to?

EnricoMi · 2024-02-21T07:22:21Z

@thomas-bouvier do you have any new findings regarding this bad_alloc?

thomas-bouvier · 2024-02-21T21:48:13Z

I couldn't make any progress since last time. Feel free to investigate on your side if you have some time. I will get back to it eventually.

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from aeb4751 to cccd59e Compare September 6, 2023 08:55

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 6910092 to 8b1da81 Compare September 6, 2023 20:22

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 1dde691 to 417ad99 Compare September 27, 2023 19:10

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from a91a1c5 to d77ce48 Compare October 27, 2023 17:58

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from e954cd8 to e31c6e7 Compare December 29, 2023 15:10

EnricoMi changed the base branch from master to ci-pin-docker December 29, 2023 15:11

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch 2 times, most recently from 5cd1d4d to 9f7722d Compare December 30, 2023 13:11

EnricoMi mentioned this pull request Jan 4, 2024

Compile using C++17 when using PyTorch 2.1 #3998

Open

4 tasks

Base automatically changed from ci-pin-docker to master January 5, 2024 20:04

EnricoMi added 10 commits January 5, 2024 21:07

Upgrade to torch==2.0.1

9e9ab06

Signed-off-by: Enrico Minack <[email protected]>

Upgrade to tensorflow==2.13.0

c18962e

Signed-off-by: Enrico Minack <[email protected]>

Update to tensorflow==2.14.0

abfc7f1

Signed-off-by: Enrico Minack <[email protected]>

Move baseline to Python 3.9

804f5cb

Move baseline to Python 3.9

871ae60

Relax to pandas<1.2.0

1ade0ff

Use PEP 517 to install h5py

b3027ae

Upgrade to torch==2.1.2

73e80af

Signed-off-by: Enrico Minack <[email protected]>

Pin h5py<3 only for tensorflow<2.4

21336c5

Signed-off-by: Enrico Minack <[email protected]>

Build Pytorch (and Gloo) with C++17 for >= 2.1.0

e11592b

Signed-off-by: Enrico Minack <[email protected]>

EnricoMi force-pushed the branch-ci-upgrade-test-frameworks-9 branch from 9f7722d to e11592b Compare January 15, 2024 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Upgrade test frameworks #3979

[CI] Upgrade test frameworks #3979

EnricoMi commented Sep 6, 2023 •

edited

github-actions bot commented Sep 6, 2023 •

edited

github-actions bot commented Sep 6, 2023 •

edited

EnricoMi commented Sep 6, 2023

chongxiaoc commented Sep 7, 2023

EnricoMi commented Sep 7, 2023

EnricoMi commented Dec 30, 2023 •

edited

thomas-bouvier commented Jan 15, 2024

EnricoMi commented Jan 15, 2024 •

edited

EnricoMi commented Jan 15, 2024

thomas-bouvier commented Jan 19, 2024

EnricoMi commented Jan 19, 2024

thomas-bouvier commented Jan 19, 2024 •

edited

EnricoMi commented Jan 20, 2024 •

edited

EnricoMi commented Jan 20, 2024

EnricoMi commented Feb 21, 2024

thomas-bouvier commented Feb 21, 2024

[CI] Upgrade test frameworks #3979

Are you sure you want to change the base?

[CI] Upgrade test frameworks #3979

Conversation

EnricoMi commented Sep 6, 2023 • edited

github-actions bot commented Sep 6, 2023 • edited

Unit Test Results

github-actions bot commented Sep 6, 2023 • edited

Unit Test Results (with flaky tests)

EnricoMi commented Sep 6, 2023

chongxiaoc commented Sep 7, 2023

EnricoMi commented Sep 7, 2023

EnricoMi commented Dec 30, 2023 • edited

thomas-bouvier commented Jan 15, 2024

EnricoMi commented Jan 15, 2024 • edited

EnricoMi commented Jan 15, 2024

thomas-bouvier commented Jan 19, 2024

EnricoMi commented Jan 19, 2024

thomas-bouvier commented Jan 19, 2024 • edited

EnricoMi commented Jan 20, 2024 • edited

EnricoMi commented Jan 20, 2024

EnricoMi commented Feb 21, 2024

thomas-bouvier commented Feb 21, 2024

EnricoMi commented Sep 6, 2023 •

edited

github-actions bot commented Sep 6, 2023 •

edited

github-actions bot commented Sep 6, 2023 •

edited

EnricoMi commented Dec 30, 2023 •

edited

EnricoMi commented Jan 15, 2024 •

edited

thomas-bouvier commented Jan 19, 2024 •

edited

EnricoMi commented Jan 20, 2024 •

edited