Skip to content

Conversation

vaishdho1
Copy link

Why are these changes needed?

Currently, when Serve file logs are buffered via a MemoryHandler, ServeContextFilter fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing.
This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior.

Related issue number

Closes #55851

Performance Testing

Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix.
Performance- Used Locust with 100 users for a duration of 3-4 mins

Without buffering:
With fix: Avg: 396.69(ms), P99: 580(ms), RPS: 228.4
Without fix: 391.29(ms), P99: 560(ms), RPS: 239

With buffering:
set RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE = 1000
With fix: Avg(ms): 400.83, P99(ms): 620, RPS: 230.5
Without fix: Avg(ms): 373.25, P99(ms): 610, RPS: 249.4

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@vaishdho1 vaishdho1 requested a review from a team as a code owner August 29, 2025 20:31
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of buffered logs reusing request contexts by capturing the context at emit time. The approach of wrapping logger methods for buffered logs is sound, and making ServeContextFilter idempotent is a necessary change. My feedback includes a couple of suggestions to improve the robustness of the new wrap_logger_for_buffering function to prevent potential side effects and make its signature more explicit.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Vaishnavi Panchavati <[email protected]>
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Aug 30, 2025
@abrarsheikh
Copy link
Contributor

Thanks for the analysis on this.

When i compare:

  1. Buffering with fix Avg(ms): 400.83, P99(ms): 620, RPS: 230.5
  2. No-buffering without fix 391.29(ms), P99: 560(ms), RPS: 239

I can conclude that it's better to remove MemoryHandler than to apply this fix. The reason why MemoryHandler was added was to improve performance. Can you think of a more optimal solution, if not I suggest we drop the memory handle. @akyang-anyscale, any other ideas?

@vaishdho1
Copy link
Author

I tried running these tests 4-5 times more. I found the following stats.

Case Requests Avg P99 RPS
Buffer-original 33372 345.32. 540. 291
Buffer-Fix 31857 349.41 550 281.9
Nonbuffer-original. 38024 381. 560. 253.6
Nonbuffer-fix 31662. 376.73. 550. 249.7

For different runs the latency slightly differs so these were the best figures after running it 4-5 times.

  • There is another method I can think of for this. We can add the ServeContextFilter() to the memory handler for the buffered case instead of adding it to the file handler directly. This will take care of adding context in the buffered case when the filter is added to the memory handler. But I need to check latency here.

  • Another method is to use the Log Record factory and add the context to the log record. This logic is already implemented inside https://github.com/ray-project/ray/blob/master/python/ray/_private/log.py#L71 for adding custom time to all logs. Something similar can be used inside serve for context. But, I am not sure how this will effect the flow. I need to look at this.

I also have a couple of questions here

  1. The example I checked is a very small application with just 2 deployments so if there is a bigger application with more deployments and more logging, the latency might increase without buffering right? So the memory handler can help in this case.
  2. I also wanted to know what is the threshold for the different parameters after which we decide the latency is large?

@vaishdho1
Copy link
Author

I have analyzed adding ServeContextFilter directly to the memory handler.
In this case, the filter is added to the memory handler directly instead of the file handler since the file handler should always move through the memory handler for both buffered and non buffered cases.

I have removed adding the filter to the file handler

if(
     logging_config.enable_access_log or
     RAY_SERVE_ENABLE_JSON_LOGGING or 
     logging_config.encoding == EncodingType.JSON
   ):
      memory_handler.addFilter(ServeContextFilter())
   

I feel this method is more robust since we are not explicitly adding wrappers around specific logging levels.

The latency comparison is shown below:

Method                          Average latency       P99        RPS
Original+buffering                324.68              550         280.9    
Original+ Nobuffering             373.72              550         250.9
Fix+ buffering                      330               530         307.6
Fix+ No buffering                  359.83             540         272.4

@vaishdho1
Copy link
Author

Fixed the code with the changes and benchmarked the results here.

Method.             Requests      Avg      P99      RPS
Buffering+fix         25643       269.8      440      350.1
Nobuffering+fix       24844       308        460      305.4
Buffering+original    25582       271.6      460      335
Nobuffering+original  24818       303.94     470     303.94   

@abrarsheikh
Copy link
Contributor

Any explanation for why Buffering+fix performs better?

Let's add a test in test_logging_utils.py to make sure when buffering is used request_id is not duplicated.

@vaishdho1
Copy link
Author

The performance of buffer+original and buffer+fix is almost similar. Sometimes one performs better than the other but they are very close. I don't see a concrete reason behind the difference because they are effectively doing the same thing but in a different order.

For the test case, I am thinking of implementing a small deployment with logging and sending requests(>buffer_Size) with buffering enabled. I will then count the occurrences of replica_ids in the logs(system and application) generated?
This would be sufficient for the use case right?
I can add this directly under ray/serve/tests/test_logging.py

@vaishdho1
Copy link
Author

Added a test which checks reuse of request ids for buffering case

request_id = log_entry.get("request_id", None)
if request_id:
log_request_ids.append(request_id)
# Verify no excessive duplication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tolerate any duplication?

Copy link
Author

@vaishdho1 vaishdho1 Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the logs also print each log message here.Since I have 2 log messages from the application and one access log message for each ID we get about 3 everytime. This buffering is only seen with application logs

=== Request ID: d5587f91-88db-4c5b-95aa-a7139c081ee1 (appears 3 times) ===
   1 Additional log entry
   1 GET / 200 0.5ms
   1 Processing request

=== Request ID: d9c323cb-67f0-4de1-8658-8c43091dca18 (appears 3 times) ===
   1 Additional log entry
   1 GET / 200 0.7ms
   1 Processing request

=== Request ID: da5a5dac-c07e-4522-9b22-bbc7a91fe95b (appears 3 times) ===
   1 Additional log entry
   1 GET / 200 0.6ms
   1 Processing request

=== Request ID: dd74bfc1-9589-4835-ab23-cb5f4d5d8f2c (appears 3 times) ===
   1 Additional log entry
   1 GET / 200 0.6ms
   1 Processing request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, then it's better to assert

  1. the exact count instead of assert count <= 5
  2. checks to ensure that each of 3 log messages has unique contents.

this will make the tests more robust and easier to read


serve.run(TestApp.bind())

for _ in range(150 if buffer_size > 1 else 20):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same number of requests regardless, I see no harm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this to a single number.

def test_request_id_uniqueness_with_buffering(buffer_size, monkeypatch):
"""Test request IDs are unique when buffering is enabled."""

monkeypatch.setenv("RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE", str(buffer_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think setting the env variable here works, you should be able to verify this by logging the buffer size inside serve.
Essentially setting env variable after serve has already started has no effect.
See test_configure_default_serve_logger_with_stderr_redirect on how to set the env variable correctly. Something like

@pytest.mark.parametrize(
    "ray_instance",
    [
        {"RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE": "100"},
    ],
    indirect=True,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Using RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE reuses replica context across logs
2 participants