🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

binary-husky · 2025-03-30T13:22:40Z

Change VLLM Server from Sync to Async
- if is_async=True:
  client first call generate (non-blocking), then after a while call get_future (with identical arguments) to get result
- if is_async=False:
  client automatically call get_future inside generate, blocking further execution before the generation is complete

Speed up grpo_trainer 1.5x faster by submitting N=gradient_accumulation_steps batches, so that training and vllm generation can run in parallel!
However, I have to admit that this piece of code is not elegant enough, remove them if they disqualifies.

I leave some room by adding a RolloutEngine in trl.scripts.vllm_serve, for more sophisticated vllm inference functionality, trying to support lm_generate > MCP tool_call > lm_generate > another MCP tool_call > ..., but not complete yet.

add vllm_server_nccl_port in config (previously cannot change default)

binary-husky · 2025-03-30T13:33:58Z

oh, there is another detail worth mentioning:

I add a version param, self.version += 1 whenever update_model_params is called.

at server side, I add some lines to ensure there are no on-going generation with some async sleep logic

fabianlim · 2025-03-31T03:01:48Z

@binary-husky the speedups you posted look great, though I have a question on how you parallelize the computation. THe picture shows a data dependency between roll outs and model training (and vllm update).

are you saying that within gradient accumulation steps the rollouts do not change?
the completion_ids are futures, are you saying the will return enough rollouts for you to complete the grad accum step?

In other words, this achieve parallization within grad accum steps, and works only if the grad accum > 1?

binary-husky · 2025-03-31T06:23:11Z

@fabianlim Yes, works only if the grad acc step > 1.

vllm sync -> async

5f951c0

binary-husky changed the title ~~(AsyncLLMEngine) Change VLLM Server from Sync to Async~~ 🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async Mar 30, 2025

binary-husky mentioned this pull request Mar 30, 2025

Co-Locating vLLM w/ training to achieve higher throughput and GPU utilization #3162

Open

5 tasks

shirinyamani self-requested a review March 30, 2025 17:40

binary-husky changed the title ~~🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async~~ 🚀 Enhance GRPO VLLM server from sync to async and accelerate training Mar 31, 2025

binary-husky and others added 2 commits April 1, 2025 19:17

update task display

9bd7c17

Merge branch 'main' into binary_husky_main

5d75d5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

binary-husky commented Mar 30, 2025

binary-husky commented Mar 30, 2025

fabianlim commented Mar 31, 2025

binary-husky commented Mar 31, 2025 •

edited

Loading

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Are you sure you want to change the base?

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Conversation

binary-husky commented Mar 30, 2025

binary-husky commented Mar 30, 2025

fabianlim commented Mar 31, 2025

binary-husky commented Mar 31, 2025 • edited Loading

binary-husky commented Mar 31, 2025 •

edited

Loading