feat(moe): add topk_before_score routing and use_router_bias support #2212

eous · 2026-01-08T23:49:43Z

Add support for GPT-OSS style MoE routing with TopK-then-Score strategy.

Changes to MoEArgs:

Add topk_before_score: bool = False (preserve backward compat)
Add use_router_bias: bool = False (for GPT-OSS learned router bias)
Add use_expert_bias: bool = False (for GPT-OSS expert biases)

Changes to TokenChoiceTopKRouter:

Support both routing strategies:
- TopK-then-Score (GPT-OSS): topk on raw logits, then score_func
- Score-then-TopK (DeepSeek-V3): score_func to all, then topk
Add validation: node-limited routing not supported with topk_before_score
Fix expert_bias handling: gather unbiased logits for score computation
Add _debug_force_load_balance support for TopK-then-Score path
Add explicit bias initialization in init_weights
Change histc to bincount for token counting (more reliable)

GPT-OSS config updates:

20B/120B: Add topk_before_score=True, use_router_bias=True, use_expert_bias=True
120B: Fix top_k from 4 to 8 (correct value for 120B model)
Add GptOssStateDictAdapter to all

Add support for GPT-OSS style MoE routing with TopK-then-Score strategy. Changes to MoEArgs: - Add topk_before_score: bool = False (preserve backward compat) - Add use_router_bias: bool = False (for GPT-OSS learned router bias) - Add use_expert_bias: bool = False (for GPT-OSS expert biases) Changes to TokenChoiceTopKRouter: - Support both routing strategies: - TopK-then-Score (GPT-OSS): topk on raw logits, then score_func - Score-then-TopK (DeepSeek-V3): score_func to all, then topk - Add validation: node-limited routing not supported with topk_before_score - Fix expert_bias handling: gather unbiased logits for score computation - Add _debug_force_load_balance support for TopK-then-Score path - Add explicit bias initialization in init_weights - Change histc to bincount for token counting (more reliable) GPT-OSS config updates: - 20B/120B: Add topk_before_score=True, use_router_bias=True, use_expert_bias=True - 120B: Fix top_k from 4 to 8 (correct value for 120B model) - Add GptOssStateDictAdapter to __all__

Copilot

Pull request overview

This PR adds support for GPT-OSS style MoE routing with a TopK-then-Score strategy, distinct from the existing Score-then-TopK strategy used by DeepSeek-V3 and Llama4 models. The changes introduce new routing options while maintaining backward compatibility.

Key changes:

Added three new fields to MoEArgs: topk_before_score, use_router_bias, and use_expert_bias
Refactored TokenChoiceTopKRouter to support both routing strategies with proper validation
Updated GPT-OSS configurations (20B and 120B) with correct routing parameters and top_k values
Improved token counting by replacing histc with bincount for better reliability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
torchtitan/models/moe/moe.py	Added new MoEArgs fields for GPT-OSS routing, refactored TokenChoiceTopKRouter to support both TopK-then-Score and Score-then-TopK strategies, added router bias initialization, and changed token counting from histc to bincount
torchtitan/models/gpt_oss/init.py	Updated 20B and 120B configurations with new routing parameters (topk_before_score, use_router_bias, use_expert_bias), fixed 120B top_k from 4 to 8, and added GptOssStateDictAdapter to all exports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-08T23:54:55Z

torchtitan/models/moe/moe.py

+                else:
+                    all_scores = F.softmax(router_logits.to(torch.float32), dim=1)


Inconsistent error handling for score_func validation. In the _debug_force_load_balance path (lines 331-334), when score_func is not "sigmoid", it assumes "softmax" without validation (line 334). However, in the normal topk_before_score path (lines 350-355), there's proper validation with an else clause that raises NotImplementedError for unknown score functions. This inconsistency means invalid score_func values could be silently accepted in debug mode but rejected in normal mode.

Add an explicit elif check for "softmax" and an else clause that raises NotImplementedError, similar to lines 350-355.

Suggested change

else:

all_scores = F.softmax(router_logits.to(torch.float32), dim=1)

elif self.score_func == "softmax":

all_scores = F.softmax(router_logits.to(torch.float32), dim=1)

else:

raise NotImplementedError(f"Unknown score function {self.score_func}")

Copilot · 2026-01-08T23:54:55Z

torchtitan/models/moe/moe.py

-            ) = self._debug_force_load_balance_routing(scores)
+            _, selected_experts_indices = torch.topk(
+                scores_for_choice, k=self.top_k, dim=-1, sorted=False
+            )


The _debug_force_load_balance handling is missing in the Score-then-TopK path (the else branch starting at line 356). In the original code, _debug_force_load_balance_routing was called after the topk selection to override the routing. However, in the refactored code, this check only exists in the topk_before_score path (lines 330-335) but is completely missing in the Score-then-TopK path. This means _debug_force_load_balance won't work when topk_before_score=False, which is the default setting and the setting used by DeepSeek-V3 and Llama4 models.

Add the _debug_force_load_balance check in the Score-then-TopK path after line 376, similar to how it was in the original code.

Suggested change

)

)

# Optionally override routing for debugging (force load balance), similar to topk_before_score path

if getattr(self, "_debug_force_load_balance", False):

selected_experts_indices = self._debug_force_load_balance_routing(

selected_experts_indices

)

Copilot · 2026-01-08T23:54:55Z

torchtitan/models/moe/moe.py

        route_norm: bool,
        route_scale: float,
+        use_router_bias: bool = False,
+        topk_before_score: bool = True,


The default value for topk_before_score parameter is inconsistent between the constructor and MoEArgs. In the constructor (line 224), it defaults to True, but in MoEArgs (line 34), it defaults to False. This inconsistency could lead to unexpected behavior if the constructor is called directly without going through MoEArgs. The constructor default should be False to match MoEArgs and maintain backward compatibility.

Suggested change

topk_before_score: bool = True,

topk_before_score: bool = False,

- Fix default topk_before_score=False in constructor to match MoEArgs - Add _debug_force_load_balance support to Score-then-TopK path - Add explicit score_func validation in debug path

tianyu-l · 2026-01-20T06:26:42Z

torchtitan/models/moe/moe.py

gpt-oss has its own MoE implementation, please don't modify the vanilla one.
https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/gpt_oss/model/moe.py

Ok, Ill take a look. Thank you for the pointer.

Copilot AI review requested due to automatic review settings January 8, 2026 23:49

eous requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 8, 2026 23:49

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026

Copilot started reviewing on behalf of eous January 8, 2026 23:50 View session

eous mentioned this pull request Jan 8, 2026

feat(gpt-oss): Add CPU offload optimizer, differential LR/WD, and more #2205

Open

Copilot AI reviewed Jan 8, 2026

View reviewed changes

shuhuayu self-assigned this Jan 8, 2026

fix: address review feedback for MoE routing

136392a

- Fix default topk_before_score=False in constructor to match MoEArgs - Add _debug_force_load_balance support to Score-then-TopK path - Add explicit score_func validation in debug path

tianyu-l requested changes Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(moe): add topk_before_score routing and use_router_bias support #2212

feat(moe): add topk_before_score routing and use_router_bias support #2212

Uh oh!

eous commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

tianyu-l Jan 20, 2026

Uh oh!

eous Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		else:
		all_scores = F.softmax(router_logits.to(torch.float32), dim=1)

-            )
+            )
+            # Optionally override routing for debugging (force load balance), similar to topk_before_score path
+            if getattr(self, "_debug_force_load_balance", False):
+                selected_experts_indices = self._debug_force_load_balance_routing(
+                    selected_experts_indices
+                )

	topk_before_score: bool = True,
	topk_before_score: bool = False,

feat(moe): add topk_before_score routing and use_router_bias support #2212

Are you sure you want to change the base?

feat(moe): add topk_before_score routing and use_router_bias support #2212

Uh oh!

Conversation

eous commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

eous Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants