Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sync] Sync master branch. #457

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

[Sync] Sync master branch. #457

wants to merge 45 commits into from

Conversation

wumuzi520
Copy link
Collaborator

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

edoakes and others added 30 commits January 7, 2025 15:59
…9695)

I don't know how the [PR](ray-project#49595)
passed tests without this...

Signed-off-by: Edward Oakes <[email protected]>
## Why are these changes needed?

Create request metadata for a new request through new function
`get_request_metadata`.


---------

Signed-off-by: Cindy Zhang <[email protected]>
…49612)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Refactor the code so it can be overwritten in the ImageURIPlugin

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
…project#49697)

## Why are these changes needed?

State was being leaked across calls to `serve.run` due to in-place
mutation within `get_deploy_args`.

I've moved the logic into `build_app` and added associated unit tests.
Also added an integration test matching the original bug report.

## Related issue number

Closes ray-project#49074

---------

Signed-off-by: Edward Oakes <[email protected]>
…9573)

Current operator fusion rule doesn't consider concurrency. 
This PR fixes this issue by only allow fusing 2 operators when they have
the same concurrency.
For task->actor, we allow fusion when task's concurrency = actor's upper
bound.
Also fixed some type hinting issues regarding compute strategy. 

---------

Signed-off-by: Hao Chen <[email protected]>
Before we prepare cgroup V2 environment at
ray-project#48828 and setup cgroup for
process at ray-project#48788, we should
first check whether cgroupv2 is properly mounted.

---------

Signed-off-by: dentiny <[email protected]>
Today we create `DashboardHeadModule`s with a `DashboardHead`. In
preparation to support a multi process Dashboard, we need to use
something serializable to create Heads.

Introducing a DashboardHeadModuleConfig class to keep all strings and
ints a Head needs to start up. Lazily creates these non-serializable
objects in the DashboardHeadModule base class:

- `gcs_client`, `gcs_aio_client`
- `aiogrpc_gcs_channel`
- `http_session`

One exception is made for `metrics`, which is not serializable. Now we
pass the metrics from DashboardHead to the Config and it's used only in
MetricsHead. We can:

1. keep MetricsHead a non-Actor Head forever, to emit metrics in the
dashboard metrics export port, OR
2. redefine `DashboardMetricsAddress` to be multiple values, and let
each Head to export their own ports, OR
3. redefine Dashboard metrics as [Ray Application Level
Metrics](https://docs.ray.io/en/latest/ray-observability/user-guides/add-app-metrics.html#application-level-metrics).
This may need some tweaks in ray.util.metrics to make sure the Component
is not `core_worker` but `dashboard.`

---------

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
ray-project#48838)

I spent some time reading through the code, this PR passes "physical
mode" ray cluster parameter from command line to raylet, which is
required to check cgroup availability and setup preparation.

This PR only adds parameter to existing code, but not enabled; so it's a
no-op to production.

---------

Signed-off-by: dentiny <[email protected]>
Signed-off-by: hjiang <[email protected]>
…roject#49552)

## Why are these changes needed?

Return `DeploymentUnavailableError` if deployment replicas failed to
start multiple times and transitioned to `DEPLOY_FAILED`.

To accomplish this, we add a flag (indicating whether deployment is
available or not) that's broadcasted along with running replica infos to
routers.
```
@DataClass(frozen=True)
class DeploymentTargetInfo:
    is_available: bool
    running_replicas: List[RunningReplicaInfo]
```

## Related issue number

Closes ray-project#48654


---------

Signed-off-by: Cindy Zhang <[email protected]>
Small isolated PR to (hopefully) fix flakiness issues with
`python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`,
noticed while working on ray-project#48807 and other PRs.

---------

Signed-off-by: Josh Karpel <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
so that the dependency are naturally cross platform, and conforms with python standards.

---------

Signed-off-by: Ben Mares <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
dentiny and others added 15 commits January 8, 2025 21:22
I mistakenly approved ray-project#49513
before making copy edits. This PR is a mulligan.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: angelinalg <[email protected]>
Signed-off-by: Sven Mika <[email protected]>
Signed-off-by: angelinalg <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
…49719)


Probably not perfect, but a small patch fix...

Without this:
```
> serve run main:app FOO='abc=def'
Error: Invalid key-value string 'FOO=abc=def'. Must be of the form 'key=value'
```

---------

Signed-off-by: Edward Oakes <[email protected]>
The read_parquet implementation might infer the schema from the first file. This PR adds a test to ensure that implementation handles the case where the first file has no data and the inferred type is null.


---------

Signed-off-by: Balaji Veeramani <[email protected]>
## Why are these changes needed?

Tunes the Python garbage collector to reduce its frequency running in
the proxy by default. A feature flag is introduced to disable this
behavior and/or tune the threshold.

## Benchmarks

```python
from ray import serve

@serve.deployment(
    max_ongoing_requests=100,
    num_replicas=16,
    ray_actor_options={"num_cpus": 0},
)
class A:
    async def __call__(self):
        return b"hi"

app = A.bind()
```

```
ab -n 10000 -c 100 http://127.0.0.1:8000/
```

No optimization:
```
Concurrency Level:      100
Time taken for tests:   11.985 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1910000 bytes
HTML transferred:       120000 bytes
Requests per second:    834.34 [#/sec] (mean)
Time per request:       119.855 [ms] (mean)
Time per request:       1.199 [ms] (mean, across all concurrent requests)
Transfer rate:          155.62 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.6      0      28
Processing:     5  119  30.3    121     227
Waiting:        3  118  30.2    120     225
Total:          5  120  30.2    121     227

Percentage of the requests served within a certain time (ms)
  50%    121
  66%    128
  75%    135
  80%    141
  90%    158
  95%    173
  98%    189
  99%    196
 100%    227 (longest request)
```

GC freeze only:
```
Concurrency Level:      100
Time taken for tests:   11.838 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1910000 bytes
HTML transferred:       120000 bytes
Requests per second:    844.72 [#/sec] (mean)
Time per request:       118.383 [ms] (mean)
Time per request:       1.184 [ms] (mean, across all concurrent requests)
Transfer rate:          157.56 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.7      0      28
Processing:     5  117  31.5    119     302
Waiting:        3  116  31.5    118     300
Total:          5  118  31.5    119     303

Percentage of the requests served within a certain time (ms)
  50%    119
  66%    127
  75%    134
  80%    138
  90%    151
  95%    165
  98%    184
  99%    230
 100%    303 (longest request)
```

GC threshold set to `10_000` (*default after this PR*):
```
Concurrency Level:      100
Time taken for tests:   11.223 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1910000 bytes
HTML transferred:       120000 bytes
Requests per second:    891.00 [#/sec] (mean)
Time per request:       112.233 [ms] (mean)
Time per request:       1.122 [ms] (mean, across all concurrent requests)
Transfer rate:          166.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      23
Processing:     5  111  26.9    116     202
Waiting:        2  110  27.0    115     199
Total:          5  112  26.8    116     202

Percentage of the requests served within a certain time (ms)
  50%    116
  66%    124
  75%    128
  80%    132
  90%    146
  95%    154
  98%    164
  99%    169
 100%    202 (longest request)
```

GC threshold set to `100_000`:
```
Concurrency Level:      100
Time taken for tests:   11.481 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      1910000 bytes
HTML transferred:       120000 bytes
Requests per second:    870.98 [#/sec] (mean)
Time per request:       114.813 [ms] (mean)
Time per request:       1.148 [ms] (mean, across all concurrent requests)
Transfer rate:          162.46 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       3
Processing:     5  114  25.0    112     256
Waiting:        2  113  25.0    111     254
Total:          5  114  24.9    112     256

Percentage of the requests served within a certain time (ms)
  50%    112
  66%    116
  75%    119
  80%    123
  90%    150
  95%    159
  98%    185
  99%    201
 100%    256 (longest request)
```

---------

Signed-off-by: Edward Oakes <[email protected]>
The iter_batches release test doesn't actually execute the dataset.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
I just realized I made a mistake:
- In the prev cleanup PR
(ray-project#49739), I unified `cc_*` to
`ray_cc_*` with `sed`, but I forgot to remove the copts;
- Default copts have already been applied, no need to re-apply again at
BUILD rule,
https://github.com/ray-project/ray/blob/33a61aaa9f50be4e1572e732dfe3e92640a056cf/bazel/ray.bzl#L169-L176

Signed-off-by: dentiny <[email protected]>
I forgot to check the result for writing PID into cgroup files.

---------

Signed-off-by: dentiny <[email protected]>
@wumuzi520 wumuzi520 self-assigned this Jan 10, 2025
@wumuzi520 wumuzi520 requested a review from NKcqx January 10, 2025 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.