Skip to content

Conversation

mattip
Copy link
Contributor

@mattip mattip commented Sep 2, 2025

Unvendor boost, refactored from #224. Closes #229. Also clean up some leftover comments and debug cruft.

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-admin
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/recipe.yaml) and found it was in an excellent condition.

@mattip
Copy link
Contributor Author

mattip commented Sep 2, 2025

@conda-forge-admin, please rerender

@apmorton
Copy link
Contributor

apmorton commented Sep 2, 2025

fyi, I'm working on unvendoring more stuff locally; was planning to land them all at once

@mattip
Copy link
Contributor Author

mattip commented Sep 2, 2025

Ahh, cool. Should I close this, or do you want to see if the robocopy works on windows

mkdir thirdparty\systemlibs
robocopy /E %RECIPE_DIR%\systemlibs thirdparty\systemlibs

@apmorton
Copy link
Contributor

apmorton commented Sep 2, 2025

lets see if robocopy works; I'll open a PR once I have stuff building on linux with the major painful libraries unvendored


echo ==========================================================
echo calling pip to install
echo ==========================================================
cd python
echo startup --output_user_root=D:/tmp >> ..\.bazelrc
echo build --jobs=1 >> ..\.bazelrc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel is crashing while building. Maybe this is needed?

@mattip
Copy link
Contributor Author

mattip commented Sep 3, 2025

osx_64 is failing with a missing boost symbol

OSError: dlopen($PREFIX/lib/python3.10/site-packages/ray/_raylet.so, 0x000A): \
symbol not found in flat namespace '__ZN5boost6chrono12steady_clock3nowEv'

@h-vetinari
Copy link
Member

__ZN5boost6chrono12steady_clock3nowEv

That sounds like a stdlib symbol. Looking upstream, boost expects that symbol unconditionally. You could try setting -D_LIBCPP_DISABLE_AVAILABILITY or bump the c_stdlib_version.

@apmorton
Copy link
Contributor

apmorton commented Sep 3, 2025

unvendoring protobuf/grpc are shaping up to be a monumental feat that likely requires non-trivial upstream changes to land.

ray transitively pulls in many bazel rules it required from grpc
ray is on bazel 6.5.0
newer grpc/etc versions require a newer bazel
upgrading bazel requires upgrading many dependencies (including grpc 🙄)

This is quite the mess to untangle.

@mattip
Copy link
Contributor Author

mattip commented Sep 3, 2025

At some point conda-forge’s build will diverge so far from upstream we may as well use meson to build what we need

@apmorton
Copy link
Contributor

apmorton commented Sep 3, 2025

Patches like this also effectively render it impossible to unvendor libraries.

That patch isn't fixing a compilation bug or backporting some feature, its changing behavior.
There are a number of patches like that in ray unfortunately - spdlog is another.

@apmorton
Copy link
Contributor

apmorton commented Sep 3, 2025

potentially hot take - should this feedstock switch to binary repackaging upstream wheels?

Maintaining ray on conda-forge is getting more complicated as conda-forge moved forward with bazel, protobuf, grpc, etc and ray does not.

There is fairly little upstream movement on modernization of the build stack, and we're already several grpc migrations behind with no clear path forward IMO.

I'm not actually sure if binary repackaging solves the grpc version problem, but it's maybe worth a try?
Historically this would deadlock if ray was compiled with a (significantly enough?) different version of grpc:

          - ray start --head
          - python -c "import ray; ray.init('ray://127.0.0.1:10001')"
          - ray stop

@h-vetinari
Copy link
Member

This to me is (unfortunately) the cost of bazel, for all it's other benefits. Well, that and patching third-party dependencies, which should be a stop-gap measure, not a long-term strategy (but of course, who has time to draw down all the tech debt...).

Perhaps a minimal meson build is actually the right approach. If that works, perhaps you could even convince upstream to carry it in the repo as a not-officially-supported build orchestrator.

Of course, I think that bazel really should allow local overrides without jumping through these ridiculous hoops. But try to tell them that not everyone needs or wants hermetically isolated builds...

I'm not in favour of binary repackaging here, but if there's no other option and it doesn't cause worse problems overall... 🤷

@timkpaine
Copy link
Member

potentially hot take - should this feedstock switch to binary repackaging upstream wheels?

Maintaining ray on conda-forge is getting more complicated as conda-forge moved forward with bazel, protobuf, grpc, etc and ray does not.

+1, repackage and be done with the mess. Hopefully all thirdparty symbols are hidden upstream, if not that is probably easy to accomplish.

For a brief few months, it was working great 😅

@mattip
Copy link
Contributor Author

mattip commented Sep 4, 2025

If we repackage we would have to run many more tests. A PR to backport the protobuf/abseil/grpc version update in this feedstock consistently fails a single test file, each time in a different test suggesting some kind of race condition. Edit: we have gotten report of workload failures due to mixing GRPC versions.

@apmorton
Copy link
Contributor

apmorton commented Sep 4, 2025

Perhaps a minimal meson build is actually the right approach

The catch with this is all the behavior-changing patches ray applies.

Right now they have:

thirdparty/patches/grpc-configurable-thread-count.patch
thirdparty/patches/opencensus-cpp-harvest-interval.patch
thirdparty/patches/opencensus-cpp-shutdown-api.patch
thirdparty/patches/spdlog-rotation-file-format.patch

The grpc one is pretty recent; the spdlog one has been around forever.

I think all the current ones can be worked around/fixed upstream, but more philosophically, if we maintain a parallel build (even upstream) and ray developers continue to patch their deps with careless abandon, we'll have bugs that don't exist in the official binaries.

@h-vetinari
Copy link
Member

Edit: we have gotten report of workload failures due to mixing GRPC versions.

That's kind of the main argument in favour of unvendoring IMO. That way you can consistently build against one version. The mixing between versions can still happen if you build a local copy that's linked statically, unless you take very thorough care that grpc gets completely absorbed.

@apmorton
Copy link
Contributor

apmorton commented Sep 4, 2025

Running a binary repackaged ray with conda-forge grpcio results in the following segfault that is the root cause of the aforementioned test failure:

(gdb) bt
#0  0x00007bc715083f99 in grpc_core::Server::ValidateServerRequestAndCq(unsigned long*, grpc_completion_queue*, void*, grpc_byte_buffer**, grpc_core::Server::RegisteredMethod*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/ray/_raylet.so
#1  0x00007bc715085c36 in grpc_core::Server::RequestCall(grpc_call**, grpc_call_details*, grpc_metadata_array*, grpc_completion_queue*, grpc_completion_queue*, void*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/ray/_raylet.so
#2  0x00007bc6e71521eb in grpc_server_request_call () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/../../../../libgrpc.so.48
#3  0x00007bc711344c17 in __pyx_gb_4grpc_7_cython_6cygrpc_9AioServer_10generator38(__pyx_CoroutineObject*, _ts*, _object*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#4  0x00007bc7112d352b in __Pyx_Coroutine_SendEx(__pyx_CoroutineObject*, _object*, _object**, int) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#5  0x00007bc71131c4ea in __Pyx_Coroutine_AmSend(_object*, _object*, _object**) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#6  0x00007bc71131c5b2 in __Pyx_Coroutine_Yield_From_Coroutine(__pyx_CoroutineObject*, _object*, _object**) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#7  0x00007bc71133ba40 in __pyx_gb_4grpc_7_cython_6cygrpc_9AioServer_13generator39(__pyx_CoroutineObject*, _ts*, _object*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#8  0x00007bc7112d352b in __Pyx_Coroutine_SendEx(__pyx_CoroutineObject*, _object*, _object**, int) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#9  0x00007bc71131c4ea in __Pyx_Coroutine_AmSend(_object*, _object*, _object**) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so
#10 0x00007bc7167ac274 in task_step_impl (state=state@entry=0x7bc716a96df0, task=task@entry=0x7bc71032b940, exc=exc@entry=0x0) at /usr/local/src/conda/python-3.12.11/Modules/_asynciomodule.c:2869
#11 0x00007bc7167aca63 in task_step (state=0x7bc716a96df0, task=0x7bc71032b940, exc=0x0) at /usr/local/src/conda/python-3.12.11/Modules/_asynciomodule.c:3188
#12 0x0000612acf89117c in _PyObject_MakeTpCall (tstate=0x612acfd1a910 <_PyRuntime+458992>, callable=0x7bc712a6dc90, args=0x7bc711d62cd0, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.11/Objects/call.c:240
#13 0x0000612acf852105 in context_run (self=0x7bc7108fbc80, args=0x7bc711d62cc8, nargs=1, kwnames=0x0) at /usr/local/src/conda/python-3.12.11/Python/context.c:668
#14 0x0000612acf8a3f0b in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x7bc711d62cc8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.11/Objects/methodobject.c:438
#15 0x0000612acf7a56f0 in PyCFunction_Call (kwargs=0x0, args=0x7bc711d62cb0, callable=0x7bc7101645e0) at /usr/local/src/conda/python-3.12.11/Objects/call.c:387
#16 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7bc7174172e0, throwflag=<optimized out>) at Python/bytecodes.c:3263
#17 0x0000612acf9421a1 in PyEval_EvalCode (co=co@entry=0x612ad1cafbd0, globals=globals@entry=0x7bc71730d380, locals=locals@entry=0x7bc71730d380) at /usr/local/src/conda/python-3.12.11/Python/ceval.c:580
#18 0x0000612acf97c8ca in run_eval_code_obj (tstate=tstate@entry=0x612acfd1a910 <_PyRuntime+458992>, co=co@entry=0x612ad1cafbd0, globals=globals@entry=0x7bc71730d380, locals=locals@entry=0x7bc71730d380) at /usr/local/src/conda/python-3.12.11/Python/pythonrun.c:1757
#19 0x0000612acf977585 in run_mod (mod=mod@entry=0x612ad1d38700, filename=filename@entry=0x7bc71726ced0, globals=globals@entry=0x7bc71730d380, locals=locals@entry=0x7bc71730d380, flags=flags@entry=0x7ffc6cac87e0, arena=arena@entry=0x7bc71722fcb0)
    at /usr/local/src/conda/python-3.12.11/Python/pythonrun.c:1778
#20 0x0000612acf974620 in pyrun_file (fp=fp@entry=0x612ad1c414d0, filename=filename@entry=0x7bc71726ced0, start=start@entry=257, globals=globals@entry=0x7bc71730d380, locals=locals@entry=0x7bc71730d380, closeit=closeit@entry=1, flags=0x7ffc6cac87e0)
    at /usr/local/src/conda/python-3.12.11/Python/pythonrun.c:1674
#21 0x0000612acf9742be in _PyRun_SimpleFileObject (fp=0x612ad1c414d0, filename=0x7bc71726ced0, closeit=1, flags=0x7ffc6cac87e0) at /usr/local/src/conda/python-3.12.11/Python/pythonrun.c:459
#22 0x0000612acf973fe4 in _PyRun_AnyFileObject (fp=0x612ad1c414d0, filename=filename@entry=0x7bc71726ced0, closeit=closeit@entry=1, flags=flags@entry=0x7ffc6cac87e0) at /usr/local/src/conda/python-3.12.11/Python/pythonrun.c:78
#23 0x0000612acf970eb2 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7bc71726ced0, program_name=0x7bc71725d990) at /usr/local/src/conda/python-3.12.11/Modules/main.c:361
#24 pymain_run_file (config=0x612acfcbd4f0 <_PyRuntime+77008>) at /usr/local/src/conda/python-3.12.11/Modules/main.c:380
#25 pymain_run_python (exitcode=0x7ffc6cac87b4) at /usr/local/src/conda/python-3.12.11/Modules/main.c:634
#26 Py_RunMain () at /usr/local/src/conda/python-3.12.11/Modules/main.c:714
#27 0x0000612acf92c247 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.12.11/Modules/main.c:768
#28 0x00007bc71702a1ca in __libc_start_call_main (main=main@entry=0x612acf92c190 <main>, argc=argc@entry=20, argv=argv@entry=0x7ffc6cac8a48) at ../sysdeps/nptl/libc_start_call_main.h:58
#29 0x00007bc71702a28b in __libc_start_main_impl (main=0x612acf92c190 <main>, argc=20, argv=0x7ffc6cac8a48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc6cac8a38) at ../csu/libc-start.c:360
#30 0x0000612acf92c0ed in _start ()

The notable section is:

#1  0x00007bc715085c36 in grpc_core::Server::RequestCall(grpc_call**, grpc_call_details*, grpc_metadata_array*, grpc_completion_queue*, grpc_completion_queue*, void*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/ray/_raylet.so
#2  0x00007bc6e71521eb in grpc_server_request_call () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/../../../../libgrpc.so.48
#3  0x00007bc711344c17 in __pyx_gb_4grpc_7_cython_6cygrpc_9AioServer_10generator38(__pyx_CoroutineObject*, _ts*, _object*) () from /home/amorton/gh/ray-packages-feedstock/.venv/lib/python3.12/site-packages/grpc/_cython/cygrpc.cpython-312-x86_64-linux-gnu.so

cygrpc.cpython-312-x86_64-linux-gnu.so calls into a libgrpc.so C api, which then attempts to use a libgrpc.so C++ API and conflicts with symbol names linked into _raylet.so

@timkpaine turns out this isn't true:

Hopefully all thirdparty symbols are hidden upstream, if not that is probably easy to accomplish.

Any ideas how to resolve this upstream?

As it stands this means any python process which loads _raylet.so will preferentially use the statically linked libgrpc 1.67 that we vendor in.

I fear this may also be true for other libraries vendored into _raylet.so.

@mattip
Copy link
Contributor Author

mattip commented Sep 4, 2025

Ahh, cool debugging. Do you mean the test failure in ray-project/ray#51673? Curious how you got a debug stack out of the tests?

grpc_core::Server::RequestCall seems to be an internal grpc symbol, that somehow the bazel build of grpc is not hiding. At least I couldn't find a direct use of RequestCall in ray. I wonder if we could add compiler directives to hide the symbols in the upstream build of grpc. I don't see, for instance, -fvisibility=hidden in a bazel-related file in the grpc/grpc codebase at v1.67.1

@apmorton
Copy link
Contributor

apmorton commented Sep 4, 2025

Ahh, cool debugging. Do you mean the test failure in ray-project/ray#51673? Curious how you got a debug stack out of the tests?

I mean the following fails:

ray start --head
python -c "import ray; ray.init('ray://127.0.0.1:10001')"

ray start --head works, but the ray.init eventually times out

Investigation of the ray logs reveals some message about a raylet falling over:

[2025-09-04 07:11:12,503 I 382375 382417] (raylet) agent_manager.cc:82: Agent process with name dashboard_agent exited, exit code 0.
[2025-09-04 07:11:12,503 E 382375 382417] (raylet) agent_manager.cc:86: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
The raylet fate shares with the agent. This can happen because
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).
[2025-09-04 07:11:12,503 I 382375 382375] (raylet) main.cc:339: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION, reason message = dashboard_agent failed and raylet fate-shares with it.
[2025-09-04 07:11:12,503 I 382375 382375] (raylet) main.cc:342: Shutting down...

sudo dmesg shows a segfault:

[142140.073762] python3.12[382416]: segfault at 0 ip 00007bc715083f99 sp 00007ffc6cac7dd0 error 4 in _raylet.so[7bc713e00000+1a31000] likely on CPU 0 (core 0, socket 0)

apport (on my linux system) produced a core file, and loading it up in gdb gave that backtrace:

gdb /home/amorton/gh/ray-packages-feedstock/.venv/bin/python3.12 /var/lib/apport/coredump/core._home_amorton_gh_ray-packages-feedstock__venv_bin_python3_12.5061.f05043c6-9fc8-4f90-9051-7ea3e3cc3652.382416.14213954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redo boost unvendoring
5 participants