[Sync] Sync master branch. #457

wumuzi520 · 2025-01-10T01:47:02Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…9695) I don't know how the [PR](ray-project#49595) passed tests without this... Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: dentiny <[email protected]>

…nalKVPut (ray-project#49583) Signed-off-by: dayshah <[email protected]>

…oject#49588) Signed-off-by: dayshah <[email protected]>

## Why are these changes needed? Create request metadata for a new request through new function `get_request_metadata`. --------- Signed-off-by: Cindy Zhang <[email protected]>

…49612)   ## Why are these changes needed? Refactor the code so it can be overwritten in the ImageURIPlugin ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

…on (ray-project#49591) Signed-off-by: Rui Qiao <[email protected]>

…project#49697) ## Why are these changes needed? State was being leaked across calls to `serve.run` due to in-place mutation within `get_deploy_args`. I've moved the logic into `build_app` and added associated unit tests. Also added an integration test matching the original bug report. ## Related issue number Closes ray-project#49074 --------- Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: dentiny <[email protected]>

…9573) Current operator fusion rule doesn't consider concurrency. This PR fixes this issue by only allow fusing 2 operators when they have the same concurrency. For task->actor, we allow fusion when task's concurrency = actor's upper bound. Also fixed some type hinting issues regarding compute strategy. --------- Signed-off-by: Hao Chen <[email protected]>

Signed-off-by: dentiny <[email protected]>

…ay-project#49687)

…et() (ray-project#49461) Signed-off-by: Rui Qiao <[email protected]>

ray-project#49504)

Before we prepare cgroup V2 environment at ray-project#48828 and setup cgroup for process at ray-project#48788, we should first check whether cgroupv2 is properly mounted. --------- Signed-off-by: dentiny <[email protected]>

Signed-off-by: dentiny <[email protected]>

…s if no resource limits provided (ray-project#49613) Signed-off-by: Rueian <[email protected]>

Today we create `DashboardHeadModule`s with a `DashboardHead`. In preparation to support a multi process Dashboard, we need to use something serializable to create Heads. Introducing a DashboardHeadModuleConfig class to keep all strings and ints a Head needs to start up. Lazily creates these non-serializable objects in the DashboardHeadModule base class: - `gcs_client`, `gcs_aio_client` - `aiogrpc_gcs_channel` - `http_session` One exception is made for `metrics`, which is not serializable. Now we pass the metrics from DashboardHead to the Config and it's used only in MetricsHead. We can: 1. keep MetricsHead a non-Actor Head forever, to emit metrics in the dashboard metrics export port, OR 2. redefine `DashboardMetricsAddress` to be multiple values, and let each Head to export their own ports, OR 3. redefine Dashboard metrics as [Ray Application Level Metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html#application-level-metrics). This may need some tweaks in ray.util.metrics to make sure the Component is not `core_worker` but `dashboard.` --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

ray-project#48838) I spent some time reading through the code, this PR passes "physical mode" ray cluster parameter from command line to raylet, which is required to check cgroup availability and setup preparation. This PR only adds parameter to existing code, but not enabled; so it's a no-op to production. --------- Signed-off-by: dentiny <[email protected]> Signed-off-by: hjiang <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

…ay-project#49130) Signed-off-by: Hollow Man <[email protected]>

…oject#49693)

Signed-off-by: dentiny <[email protected]>

…roject#49552) ## Why are these changes needed? Return `DeploymentUnavailableError` if deployment replicas failed to start multiple times and transitioned to `DEPLOY_FAILED`. To accomplish this, we add a flag (indicating whether deployment is available or not) that's broadcasted along with running replica infos to routers. ``` @DataClass(frozen=True) class DeploymentTargetInfo: is_available: bool running_replicas: List[RunningReplicaInfo] ``` ## Related issue number Closes ray-project#48654 --------- Signed-off-by: Cindy Zhang <[email protected]>

Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

so that the dependency are naturally cross platform, and conforms with python standards. --------- Signed-off-by: Ben Mares <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…oject#49729) otherwise the wheel does not compile. Signed-off-by: dayshah <[email protected]>

…49723) Actor name is useful information to show in visualization. Before: ![compiled_graph0](https://github.com/user-attachments/assets/0c7c6bfd-7c4c-45ee-a7bf-b6073c3a1bad) After: ![compiled_graph](https://github.com/user-attachments/assets/56e3c4e5-ab1d-4dc1-86b0-fcba0073793d) Signed-off-by: Rui Qiao <[email protected]>

…9596) Signed-off-by: dentiny <[email protected]>

…Clusters on Kubernetes (ray-project#49498) Signed-off-by: win5923 <[email protected]>

Signed-off-by: dentiny <[email protected]>

… some code cleanups. (ray-project#49714)

I mistakenly approved ray-project#49513 before making copy edits. This PR is a mulligan. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: angelinalg <[email protected]> Signed-off-by: Sven Mika <[email protected]> Signed-off-by: angelinalg <[email protected]> Co-authored-by: Sven Mika <[email protected]>

…49719) Probably not perfect, but a small patch fix... Without this: ``` > serve run main:app FOO='abc=def' Error: Invalid key-value string 'FOO=abc=def'. Must be of the form 'key=value' ``` --------- Signed-off-by: Edward Oakes <[email protected]>

…roject#49503) Signed-off-by: dentiny <[email protected]>

Signed-off-by: dentiny <[email protected]>

A rewrite of ray-project#48788 --------- Signed-off-by: dentiny <[email protected]>

The read_parquet implementation might infer the schema from the first file. This PR adds a test to ensure that implementation handles the case where the first file has no data and the inferred type is null. --------- Signed-off-by: Balaji Veeramani <[email protected]>

Signed-off-by: dentiny <[email protected]>

## Why are these changes needed? Tunes the Python garbage collector to reduce its frequency running in the proxy by default. A feature flag is introduced to disable this behavior and/or tune the threshold. ## Benchmarks ```python from ray import serve @serve.deployment( max_ongoing_requests=100, num_replicas=16, ray_actor_options={"num_cpus": 0}, ) class A: async def __call__(self): return b"hi" app = A.bind() ``` ``` ab -n 10000 -c 100 http://127.0.0.1:8000/ ``` No optimization: ``` Concurrency Level: 100 Time taken for tests: 11.985 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1910000 bytes HTML transferred: 120000 bytes Requests per second: 834.34 [#/sec] (mean) Time per request: 119.855 [ms] (mean) Time per request: 1.199 [ms] (mean, across all concurrent requests) Transfer rate: 155.62 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.6 0 28 Processing: 5 119 30.3 121 227 Waiting: 3 118 30.2 120 225 Total: 5 120 30.2 121 227 Percentage of the requests served within a certain time (ms) 50% 121 66% 128 75% 135 80% 141 90% 158 95% 173 98% 189 99% 196 100% 227 (longest request) ``` GC freeze only: ``` Concurrency Level: 100 Time taken for tests: 11.838 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1910000 bytes HTML transferred: 120000 bytes Requests per second: 844.72 [#/sec] (mean) Time per request: 118.383 [ms] (mean) Time per request: 1.184 [ms] (mean, across all concurrent requests) Transfer rate: 157.56 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.7 0 28 Processing: 5 117 31.5 119 302 Waiting: 3 116 31.5 118 300 Total: 5 118 31.5 119 303 Percentage of the requests served within a certain time (ms) 50% 119 66% 127 75% 134 80% 138 90% 151 95% 165 98% 184 99% 230 100% 303 (longest request) ``` GC threshold set to `10_000` (*default after this PR*): ``` Concurrency Level: 100 Time taken for tests: 11.223 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1910000 bytes HTML transferred: 120000 bytes Requests per second: 891.00 [#/sec] (mean) Time per request: 112.233 [ms] (mean) Time per request: 1.122 [ms] (mean, across all concurrent requests) Transfer rate: 166.19 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.5 0 23 Processing: 5 111 26.9 116 202 Waiting: 2 110 27.0 115 199 Total: 5 112 26.8 116 202 Percentage of the requests served within a certain time (ms) 50% 116 66% 124 75% 128 80% 132 90% 146 95% 154 98% 164 99% 169 100% 202 (longest request) ``` GC threshold set to `100_000`: ``` Concurrency Level: 100 Time taken for tests: 11.481 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1910000 bytes HTML transferred: 120000 bytes Requests per second: 870.98 [#/sec] (mean) Time per request: 114.813 [ms] (mean) Time per request: 1.148 [ms] (mean, across all concurrent requests) Transfer rate: 162.46 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.3 0 3 Processing: 5 114 25.0 112 256 Waiting: 2 113 25.0 111 254 Total: 5 114 24.9 112 256 Percentage of the requests served within a certain time (ms) 50% 112 66% 116 75% 119 80% 123 90% 150 95% 159 98% 185 99% 201 100% 256 (longest request) ``` --------- Signed-off-by: Edward Oakes <[email protected]>

The iter_batches release test doesn't actually execute the dataset. --------- Signed-off-by: Balaji Veeramani <[email protected]>

I just realized I made a mistake: - In the prev cleanup PR (ray-project#49739), I unified `cc_*` to `ray_cc_*` with `sed`, but I forgot to remove the copts; - Default copts have already been applied, no need to re-apply again at BUILD rule, https://github.com/ray-project/ray/blob/33a61aaa9f50be4e1572e732dfe3e92640a056cf/bazel/ray.bzl#L169-L176 Signed-off-by: dentiny <[email protected]>

I forgot to check the result for writing PID into cgroup files. --------- Signed-off-by: dentiny <[email protected]>

edoakes and others added 30 commits January 7, 2025 15:59

[serve] Fix type hint in CreatePlacementGroupRequest (ray-project#4…

cf8d6cf

…9695) I don't know how the [PR](ray-project#49595) passed tests without this... Signed-off-by: Edward Oakes <[email protected]>

[core] Move example BUILD (ray-project#49611)

6d624d2

Signed-off-by: dentiny <[email protected]>

[core] Remove extra copies on InternalKVGets and return bool on Inter…

3066bbf

…nalKVPut (ray-project#49583) Signed-off-by: dayshah <[email protected]>

[core] Remove unused GetProfilingStats and job_agent protbufs (ray-pr…

8165b2d

…oject#49588) Signed-off-by: dayshah <[email protected]>

[serve] add get request metadata func (ray-project#49571)

60caa93

## Why are these changes needed? Create request metadata for a new request through new function `get_request_metadata`. --------- Signed-off-by: Cindy Zhang <[email protected]>

[core][compiled graphs] Add "experimental" to overlap_gpu_communicati…

249fb76

…on (ray-project#49591) Signed-off-by: Rui Qiao <[email protected]>

[core] [revert] Manually revert cpp example change (ray-project#49701)

d5c5082

Signed-off-by: dentiny <[email protected]>

Fix lint (ray-project#49702)

8f4c7c3

Signed-off-by: dentiny <[email protected]>

[RLlib] Add time between sample metric to MultiAgentEnvRunner. (r…

1420cbc

…ay-project#49687)

[core][compiled graphs] Throw unrecoverable actor exceptions at ray.g…

8b3bed9

…et() (ray-project#49461) Signed-off-by: Rui Qiao <[email protected]>

[RLlib] Docs do-over (new API stack): Re-write checkpointing rst page. (

0f60dad

ray-project#49504)

[core] [3/N] Check cgroupv2 mount status (ray-project#48833)

e7e097f

Before we prepare cgroup V2 environment at ray-project#48828 and setup cgroup for process at ray-project#48788, we should first check whether cgroupv2 is properly mounted. --------- Signed-off-by: dentiny <[email protected]>

[core] Add statusor macros (ray-project#49566)

e3440d3

Signed-off-by: dentiny <[email protected]>

[KubeRay] Extrace autoscaling config from Kubernetes resource request…

3491b89

…s if no resource limits provided (ray-project#49613) Signed-off-by: Rueian <[email protected]>

[Core] Report cluster config from autoscaler (ray-project#49568)

7ffad78

Signed-off-by: Jiajun Yao <[email protected]>

[Docs] Fix some misinformation in Ray Collective Communication Lib (r…

ec330bd

…ay-project#49130) Signed-off-by: Hollow Man <[email protected]>

[RLlib] Add MetricsLogger to connector- and pipeline-calls. (ray-pr…

d644135

…oject#49693)

Fix lint (ray-project#49710)

f2e1d4a

Signed-off-by: dentiny <[email protected]>

Avoid platform-specific metadata for grpcio (ray-project#49590)

68f8602

so that the dependency are naturally cross platform, and conforms with python standards. --------- Signed-off-by: Ben Mares <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[core] Fix PlasmaObjectHeader timeout_point types for windows (ray-pr…

8112b59

…oject#49729) otherwise the wheel does not compile. Signed-off-by: dayshah <[email protected]>

[core] Avoid re-parse serialized runtime env info json (ray-project#4…

879aa65

…9596) Signed-off-by: dentiny <[email protected]>

[Grafana] add Cluster variable to enable filtering of different Ray…

57d09ba

…Clusters on Kubernetes (ray-project#49498) Signed-off-by: win5923 <[email protected]>

dentiny and others added 15 commits January 8, 2025 21:22

[core] Fix default logging rotation size (ray-project#49726)

ea4e315

Signed-off-by: dentiny <[email protected]>

[RLlib] Fix train_batch_size_per_learner problems. (ray-project#49715)

afd024c

[RLlib] Add experimental option to suppress algo config validate();…

2ac04cc

… some code cleanups. (ray-project#49714)

[core] [3/N] Combine client registration and port announcement (ray-p…

33d025f

…roject#49503) Signed-off-by: dentiny <[email protected]>

[core] [no-op] Unify ray_cc* starlark function (ray-project#49739)

3b09d97

Signed-off-by: dentiny <[email protected]>

[core] Setup cgroup v2 in C++ (ray-project#49416)

e944d85

A rewrite of ray-project#48788 --------- Signed-off-by: dentiny <[email protected]>

Fix lint (ray-project#49746)

01060a5

Signed-off-by: dentiny <[email protected]>

[Data] Fix iter_batches release test (ray-project#49745)

10727d4

The iter_batches release test doesn't actually execute the dataset. --------- Signed-off-by: Balaji Veeramani <[email protected]>

[core] [easy] Check cgroup file write result (ray-project#49749)

5cfd6b5

I forgot to check the result for writing PID into cgroup files. --------- Signed-off-by: dentiny <[email protected]>

Merge branch 'master' into merge_master

46ee1e3

wumuzi520 requested review from zhenxuanpan, Chong-Li and xsuler January 10, 2025 01:47

wumuzi520 self-assigned this Jan 10, 2025

wumuzi520 requested a review from NKcqx January 10, 2025 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sync] Sync master branch. #457

[Sync] Sync master branch. #457

wumuzi520 commented Jan 10, 2025

[Sync] Sync master branch. #457

Are you sure you want to change the base?

[Sync] Sync master branch. #457

Conversation

wumuzi520 commented Jan 10, 2025

Why are these changes needed?

Related issue number

Checks