Skip to content

Conversation

@wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Jan 16, 2026

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2026
wwwjn added a commit that referenced this pull request Jan 16, 2026
ghstack-source-id: ba69db2
Pull Request resolved: #2244
@wwwjn wwwjn changed the title refactor scorer and trainer generator actor [rl] refactor scorer and trainer generator actor Jan 16, 2026


@dataclass
class TrajectoryData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we deprecated the name trajectory which is intrinsically ambiguous, but I don't know what we replace it by, Episode?

rewards: torch.Tensor


class Scorer(Actor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you chose to use Grader. Not sure what's the difference but but aligned.

def _load_initial_weights(self, model: torch.nn.Module, model_path: str) -> None:
"""Load initial weights from HuggingFace checkpoint."""
from torchtitan.experiments.rl.vllm_compat.weights.converter import (
vllm_to_torchtitan,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using this function instead of our utils like from_hf?

q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)
# vLLM attention expects bfloat16 / inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can't just happen for attention.
For torchtitan, by default dtype is fp32, and mixed precision is handled by FSDP so under pure TP forward dtype is fp32.
If vllm by default use overall bf16, we should match. O/w this is another place where torchtitan-native vllm forward would be slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes dtype difference could be a reason that we are 40% slow when TP is not enabled

This demonstrates:
1. Distributed actor architecture with Generator (vLLM) and Trainer (TorchTitan) components
1. Distributed actor architecture with Generator (vLLM), Scorer, and Trainer (TorchTitan) components
2. File based weight synchronization between trainer and generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still true?

job_config, # Pass full job_config
)

# Spawn scorer on trainer mesh (can share resources with trainer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to learn more on how Scorer/Grader work with trainer / generator.
Naively I would think they should be put on generator mesh, not trainer_mesh, although they may be the same and you are only using gpus=0 right now.

Copy link
Contributor Author

@wwwjn wwwjn Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/meta-pytorch/monarch/blob/main/docs/source/examples/grpo_actor.py#L505

I follow the practice here, the scorer is spawned on trainer mesh. My intuition is the main bottleneck is generator (generator takes longer time), so we want to put more work (eg, calculate rewards + advantages) on trainer side instead.

If we only think about algorithm , we can put it on trainer or generator. If we put it on generator, the generated "episode" will be scored episode. If we put it on trainer, the generator can just pass "unscored" episode to trainer

@wwwjn wwwjn changed the title [rl] refactor scorer and trainer generator actor [rl] refactor grader and trainer generator actor Jan 20, 2026
wwwjn added a commit that referenced this pull request Jan 20, 2026
ghstack-source-id: f2a8d93
Pull Request resolved: #2244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants