Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Enhance GRPO VLLM server from sync to async and accelerate training #3182

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

binary-husky
Copy link
Contributor

  1. Change VLLM Server from Sync to Async
    • if is_async=True:
      client first call generate (non-blocking), then after a while call get_future (with identical arguments) to get result
    • if is_async=False:
      client automatically call get_future inside generate, blocking further execution before the generation is complete
  1. Speed up grpo_trainer 1.5x faster by submitting N=gradient_accumulation_steps batches, so that training and vllm generation can run in parallel!
    However, I have to admit that this piece of code is not elegant enough, remove them if they disqualifies.
  1. I leave some room by adding a RolloutEngine in trl.scripts.vllm_serve, for more sophisticated vllm inference functionality, trying to support lm_generate > MCP tool_call > lm_generate > another MCP tool_call > ..., but not complete yet.
  1. add vllm_server_nccl_port in config (previously cannot change default)

@binary-husky binary-husky changed the title (AsyncLLMEngine) Change VLLM Server from Sync to Async 🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async Mar 30, 2025
@binary-husky
Copy link
Contributor Author

oh, there is another detail worth mentioning:

I add a version param, self.version += 1 whenever update_model_params is called.

at server side, I add some lines to ensure there are no on-going generation with some async sleep logic

@shirinyamani shirinyamani self-requested a review March 30, 2025 17:40
@binary-husky binary-husky changed the title 🚀 (AsyncLLMEngine) Improve GRPO VLLM Server from Sync to Async 🚀 Enhance GRPO VLLM server from sync to async and accelerate training Mar 31, 2025
@fabianlim
Copy link
Contributor

@binary-husky the speedups you posted look great, though I have a question on how you parallelize the computation. THe picture shows a data dependency between roll outs and model training (and vllm update).

  • are you saying that within gradient accumulation steps the rollouts do not change?
  • the completion_ids are futures, are you saying the will return enough rollouts for you to complete the grad accum step?

In other words, this achieve parallization within grad accum steps, and works only if the grad accum > 1?

image

@binary-husky
Copy link
Contributor Author

binary-husky commented Mar 31, 2025

@fabianlim Yes, works only if the grad acc step > 1.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants