-
Notifications
You must be signed in to change notification settings - Fork 677
feat(moe): add topk_before_score routing and use_router_bias support #2212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for GPT-OSS style MoE routing with TopK-then-Score strategy. Changes to MoEArgs: - Add topk_before_score: bool = False (preserve backward compat) - Add use_router_bias: bool = False (for GPT-OSS learned router bias) - Add use_expert_bias: bool = False (for GPT-OSS expert biases) Changes to TokenChoiceTopKRouter: - Support both routing strategies: - TopK-then-Score (GPT-OSS): topk on raw logits, then score_func - Score-then-TopK (DeepSeek-V3): score_func to all, then topk - Add validation: node-limited routing not supported with topk_before_score - Fix expert_bias handling: gather unbiased logits for score computation - Add _debug_force_load_balance support for TopK-then-Score path - Add explicit bias initialization in init_weights - Change histc to bincount for token counting (more reliable) GPT-OSS config updates: - 20B/120B: Add topk_before_score=True, use_router_bias=True, use_expert_bias=True - 120B: Fix top_k from 4 to 8 (correct value for 120B model) - Add GptOssStateDictAdapter to __all__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for GPT-OSS style MoE routing with a TopK-then-Score strategy, distinct from the existing Score-then-TopK strategy used by DeepSeek-V3 and Llama4 models. The changes introduce new routing options while maintaining backward compatibility.
Key changes:
- Added three new fields to MoEArgs:
topk_before_score,use_router_bias, anduse_expert_bias - Refactored TokenChoiceTopKRouter to support both routing strategies with proper validation
- Updated GPT-OSS configurations (20B and 120B) with correct routing parameters and top_k values
- Improved token counting by replacing histc with bincount for better reliability
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| torchtitan/models/moe/moe.py | Added new MoEArgs fields for GPT-OSS routing, refactored TokenChoiceTopKRouter to support both TopK-then-Score and Score-then-TopK strategies, added router bias initialization, and changed token counting from histc to bincount |
| torchtitan/models/gpt_oss/init.py | Updated 20B and 120B configurations with new routing parameters (topk_before_score, use_router_bias, use_expert_bias), fixed 120B top_k from 4 to 8, and added GptOssStateDictAdapter to all exports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
torchtitan/models/moe/moe.py
Outdated
| else: | ||
| all_scores = F.softmax(router_logits.to(torch.float32), dim=1) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error handling for score_func validation. In the _debug_force_load_balance path (lines 331-334), when score_func is not "sigmoid", it assumes "softmax" without validation (line 334). However, in the normal topk_before_score path (lines 350-355), there's proper validation with an else clause that raises NotImplementedError for unknown score functions. This inconsistency means invalid score_func values could be silently accepted in debug mode but rejected in normal mode.
Add an explicit elif check for "softmax" and an else clause that raises NotImplementedError, similar to lines 350-355.
| else: | |
| all_scores = F.softmax(router_logits.to(torch.float32), dim=1) | |
| elif self.score_func == "softmax": | |
| all_scores = F.softmax(router_logits.to(torch.float32), dim=1) | |
| else: | |
| raise NotImplementedError(f"Unknown score function {self.score_func}") |
torchtitan/models/moe/moe.py
Outdated
| ) = self._debug_force_load_balance_routing(scores) | ||
| _, selected_experts_indices = torch.topk( | ||
| scores_for_choice, k=self.top_k, dim=-1, sorted=False | ||
| ) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _debug_force_load_balance handling is missing in the Score-then-TopK path (the else branch starting at line 356). In the original code, _debug_force_load_balance_routing was called after the topk selection to override the routing. However, in the refactored code, this check only exists in the topk_before_score path (lines 330-335) but is completely missing in the Score-then-TopK path. This means _debug_force_load_balance won't work when topk_before_score=False, which is the default setting and the setting used by DeepSeek-V3 and Llama4 models.
Add the _debug_force_load_balance check in the Score-then-TopK path after line 376, similar to how it was in the original code.
| ) | |
| ) | |
| # Optionally override routing for debugging (force load balance), similar to topk_before_score path | |
| if getattr(self, "_debug_force_load_balance", False): | |
| selected_experts_indices = self._debug_force_load_balance_routing( | |
| selected_experts_indices | |
| ) |
torchtitan/models/moe/moe.py
Outdated
| route_norm: bool, | ||
| route_scale: float, | ||
| use_router_bias: bool = False, | ||
| topk_before_score: bool = True, |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for topk_before_score parameter is inconsistent between the constructor and MoEArgs. In the constructor (line 224), it defaults to True, but in MoEArgs (line 34), it defaults to False. This inconsistency could lead to unexpected behavior if the constructor is called directly without going through MoEArgs. The constructor default should be False to match MoEArgs and maintain backward compatibility.
| topk_before_score: bool = True, | |
| topk_before_score: bool = False, |
- Fix default topk_before_score=False in constructor to match MoEArgs - Add _debug_force_load_balance support to Score-then-TopK path - Add explicit score_func validation in debug path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt-oss has its own MoE implementation, please don't modify the vanilla one.
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/gpt_oss/model/moe.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Ill take a look. Thank you for the pointer.
Add support for GPT-OSS style MoE routing with TopK-then-Score strategy.
Changes to MoEArgs:
Changes to TokenChoiceTopKRouter:
GPT-OSS config updates: