Skip to content

Conversation

@eous
Copy link

@eous eous commented Jan 8, 2026

Add support for GPT-OSS style MoE routing with TopK-then-Score strategy.

Changes to MoEArgs:

  • Add topk_before_score: bool = False (preserve backward compat)
  • Add use_router_bias: bool = False (for GPT-OSS learned router bias)
  • Add use_expert_bias: bool = False (for GPT-OSS expert biases)

Changes to TokenChoiceTopKRouter:

  • Support both routing strategies:
    • TopK-then-Score (GPT-OSS): topk on raw logits, then score_func
    • Score-then-TopK (DeepSeek-V3): score_func to all, then topk
  • Add validation: node-limited routing not supported with topk_before_score
  • Fix expert_bias handling: gather unbiased logits for score computation
  • Add _debug_force_load_balance support for TopK-then-Score path
  • Add explicit bias initialization in init_weights
  • Change histc to bincount for token counting (more reliable)

GPT-OSS config updates:

  • 20B/120B: Add topk_before_score=True, use_router_bias=True, use_expert_bias=True
  • 120B: Fix top_k from 4 to 8 (correct value for 120B model)
  • Add GptOssStateDictAdapter to all

Add support for GPT-OSS style MoE routing with TopK-then-Score strategy.

Changes to MoEArgs:
- Add topk_before_score: bool = False (preserve backward compat)
- Add use_router_bias: bool = False (for GPT-OSS learned router bias)
- Add use_expert_bias: bool = False (for GPT-OSS expert biases)

Changes to TokenChoiceTopKRouter:
- Support both routing strategies:
  - TopK-then-Score (GPT-OSS): topk on raw logits, then score_func
  - Score-then-TopK (DeepSeek-V3): score_func to all, then topk
- Add validation: node-limited routing not supported with topk_before_score
- Fix expert_bias handling: gather unbiased logits for score computation
- Add _debug_force_load_balance support for TopK-then-Score path
- Add explicit bias initialization in init_weights
- Change histc to bincount for token counting (more reliable)

GPT-OSS config updates:
- 20B/120B: Add topk_before_score=True, use_router_bias=True, use_expert_bias=True
- 120B: Fix top_k from 4 to 8 (correct value for 120B model)
- Add GptOssStateDictAdapter to __all__
Copilot AI review requested due to automatic review settings January 8, 2026 23:49
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for GPT-OSS style MoE routing with a TopK-then-Score strategy, distinct from the existing Score-then-TopK strategy used by DeepSeek-V3 and Llama4 models. The changes introduce new routing options while maintaining backward compatibility.

Key changes:

  • Added three new fields to MoEArgs: topk_before_score, use_router_bias, and use_expert_bias
  • Refactored TokenChoiceTopKRouter to support both routing strategies with proper validation
  • Updated GPT-OSS configurations (20B and 120B) with correct routing parameters and top_k values
  • Improved token counting by replacing histc with bincount for better reliability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
torchtitan/models/moe/moe.py Added new MoEArgs fields for GPT-OSS routing, refactored TokenChoiceTopKRouter to support both TopK-then-Score and Score-then-TopK strategies, added router bias initialization, and changed token counting from histc to bincount
torchtitan/models/gpt_oss/init.py Updated 20B and 120B configurations with new routing parameters (topk_before_score, use_router_bias, use_expert_bias), fixed 120B top_k from 4 to 8, and added GptOssStateDictAdapter to all exports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 333 to 334
else:
all_scores = F.softmax(router_logits.to(torch.float32), dim=1)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling for score_func validation. In the _debug_force_load_balance path (lines 331-334), when score_func is not "sigmoid", it assumes "softmax" without validation (line 334). However, in the normal topk_before_score path (lines 350-355), there's proper validation with an else clause that raises NotImplementedError for unknown score functions. This inconsistency means invalid score_func values could be silently accepted in debug mode but rejected in normal mode.

Add an explicit elif check for "softmax" and an else clause that raises NotImplementedError, similar to lines 350-355.

Suggested change
else:
all_scores = F.softmax(router_logits.to(torch.float32), dim=1)
elif self.score_func == "softmax":
all_scores = F.softmax(router_logits.to(torch.float32), dim=1)
else:
raise NotImplementedError(f"Unknown score function {self.score_func}")

Copilot uses AI. Check for mistakes.
) = self._debug_force_load_balance_routing(scores)
_, selected_experts_indices = torch.topk(
scores_for_choice, k=self.top_k, dim=-1, sorted=False
)
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _debug_force_load_balance handling is missing in the Score-then-TopK path (the else branch starting at line 356). In the original code, _debug_force_load_balance_routing was called after the topk selection to override the routing. However, in the refactored code, this check only exists in the topk_before_score path (lines 330-335) but is completely missing in the Score-then-TopK path. This means _debug_force_load_balance won't work when topk_before_score=False, which is the default setting and the setting used by DeepSeek-V3 and Llama4 models.

Add the _debug_force_load_balance check in the Score-then-TopK path after line 376, similar to how it was in the original code.

Suggested change
)
)
# Optionally override routing for debugging (force load balance), similar to topk_before_score path
if getattr(self, "_debug_force_load_balance", False):
selected_experts_indices = self._debug_force_load_balance_routing(
selected_experts_indices
)

Copilot uses AI. Check for mistakes.
route_norm: bool,
route_scale: float,
use_router_bias: bool = False,
topk_before_score: bool = True,
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for topk_before_score parameter is inconsistent between the constructor and MoEArgs. In the constructor (line 224), it defaults to True, but in MoEArgs (line 34), it defaults to False. This inconsistency could lead to unexpected behavior if the constructor is called directly without going through MoEArgs. The constructor default should be False to match MoEArgs and maintain backward compatibility.

Suggested change
topk_before_score: bool = True,
topk_before_score: bool = False,

Copilot uses AI. Check for mistakes.
@shuhuayu shuhuayu self-assigned this Jan 8, 2026
- Fix default topk_before_score=False in constructor to match MoEArgs
- Add _debug_force_load_balance support to Score-then-TopK path
- Add explicit score_func validation in debug path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt-oss has its own MoE implementation, please don't modify the vanilla one.
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/gpt_oss/model/moe.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Ill take a look. Thank you for the pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants