Skip to content

Conversation

mattip
Copy link
Contributor

@mattip mattip commented Mar 25, 2025

Why are these changes needed?

The file upb/message/internal/message.c used by upb, which is used by protobuf, is not compatible with latest windows SDKs, see the visual studio discussion. Unfortunately in the current version of protobuf, upb is a transient dependency which makes patching it very difficult. Over at conda-forge we updated to grpc1.67 and protobuf 28.2. The newer version vendors upb, which gives us a cleaner path to patching the file.

Basically this PR is a backport of the work done in conda-forge to adapt to later windows SDK versions, and also helps with the transition to Visual Studio 2022 (currently the CI build uses Visual Studio 2019).

Related issue number

This is related to internal work done to update the CI build to use a windows 11 image.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mattip mattip requested a review from a team as a code owner March 25, 2025 10:07
@mattip
Copy link
Contributor Author

mattip commented Mar 25, 2025

Replaces #48724

@dentiny
Copy link
Contributor

dentiny commented Mar 25, 2025

Wow you're a hero!

@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Mar 25, 2025
@dentiny dentiny self-assigned this Mar 25, 2025
@mattip
Copy link
Contributor Author

mattip commented Mar 25, 2025

I am seeing an error ERROR: An error occurred during the fetch of repository 'python3_9_x86_64-unknown-linux-gnu':, is that expected?

@dentiny
Copy link
Contributor

dentiny commented Mar 25, 2025

I am seeing an error ERROR: An error occurred during the fetch of repository 'python3_9_x86_64-unknown-linux-gnu':, is that expected?

No, we should be able to pass all checks

@mattip
Copy link
Contributor Author

mattip commented Mar 27, 2025

CI got confused with the rebase on master, closing and reopening to hopefully clear the error.

@mattip
Copy link
Contributor Author

mattip commented Mar 28, 2025

This now has one failing test with a GOAWAY error. Is this something someone has seen before?

=================================== FAILURES ===================================
______________________________ test_worker_stdout ______________________________

  assert err_str_sec_last.endswith("def")

E assert False
E + where False = <built-in method endswith of str object at 0x7f16962c78f0>('def')
E + where <built-in method endswith of str object at 0x7f16962c78f0>
= 'I0000 00:00:1743154602.725131 18324 chttp2_transport.cc:1182] ipv4:172.16.0.3:51349:
Got goaway [2] err=UNAVAILABLE:GOAWAY received;
Error code: 2; Debug Text: Cancelling all calls
{created_time:"2025-03-28T09:36:42.725123699+00:00", http2_error:2, grpc_status:14}'.endswith

@mattip
Copy link
Contributor Author

mattip commented Mar 29, 2025

There is a ubisan compilation failure around abseil and a constant pointer. This is due to abseil/abseil-cpp#1634, which has a patch in one of the comments. Apparently the abseil devs decided this a gcc bug and do not plan on fixing it. I am not sure what version of gcc is being used in the build.

@dayshah
Copy link
Contributor

dayshah commented Mar 29, 2025

There is a ubisan compilation failure around abseil and a constant pointer. This is due to abseil/abseil-cpp#1634, which has a patch in one of the comments. Apparently the abseil devs decided this a gcc bug and do not plan on fixing it. I am not sure what version of gcc is being used in the build.

afaik the core cpp ubsan ci tests run on an ubuntu 20.04 image that we install build-essential on. The version ends up being gcc 9.4.0.

curl python-is-python3 git build-essential \

If we upgrade the ubuntu to 22.04 there, it'd become gcc 11.2 with the build-essential install. Or maybe we can just install gcc differently. This would also be one step forward towards the C++20 upgrade we want to do. @aslonnie which way makes sense are there any blockers to either?

Note: To build wheels we're stuck to gcc 10.2 which manylinux2014 comes with by default but we don't run ubsan there so should be ok for this case

@mattip
Copy link
Contributor Author

mattip commented Mar 30, 2025

To build wheels we're stuck to gcc 10.2 which manylinux2014 comes with by default

The next step after manylinux2014 is manylinux_2_28, which NumPy will be adopting for its next (June 2025) release (manylinux2024==manylinux_2_17). This is based on redhat8/cetos8/almalinux8, where manylinux2024 was based on centos7. Note that the EOL date for amazonlinux2, which might be still used by some aws images, was changed from this year to June 2026, so ray may want to stay with manylinux2024. I am basing my analysis on the PEP 600 compliance chart

In any case I will try to patch abseil with the patch in the issue comments for ubsan.

Edit: here is the amazonlinux2 EOL announcement, with the note "Amazon Linux 2 has had its LTS EOL extended multiple times from the originally scheduled date of June 2023"

@mattip

This comment was marked as resolved.

mattip added 2 commits July 28, 2025 09:42
Signed-off-by: Matti Picus <[email protected]>
Signed-off-by: Matti Picus <[email protected]>
Signed-off-by: Matti Picus <[email protected]>
@mattip
Copy link
Contributor Author

mattip commented Jul 29, 2025

github was having some outages which might have caused CI download failures, could someone rerun the failed builds?

@edoakes
Copy link
Collaborator

edoakes commented Jul 29, 2025

github was having some outages which might have caused CI download failures, could someone rerun the failed builds?

merged master which should re-run CI

@mattip
Copy link
Contributor Author

mattip commented Aug 19, 2025

I can reproduce the data testing hang in a specific test via python -m pytest python/ray/data/tests/test_groupby_e2e.py locally on Unbuntu 24.04 with python3.9 python3.9 and python3.12. When I run only the hanging test (test_groupby_nans[hash_shuffle-134217728-pyarrow] (via -k ...) the test passes. I see various errors in the logs, which I uploaded.
logs.zip. Maybe due to killing the test with CTRL-C?

@mattip
Copy link
Contributor Author

mattip commented Aug 19, 2025

The dashboard test failure seems to be network-related.

@mattip
Copy link
Contributor Author

mattip commented Aug 24, 2025

Hmm. PR #53194 caused a conflict since it adds a new abseil patch. In this comment @israbbani suggested the patch will be short-lived. Should I merge this patch in or wait for the subsequent fix that will remove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUILD: patch zlib for macos and protobuf for windows