fix(gpt-oss): correct attention sink from sigmoid to LSE renormalization #2211

eous · 2026-01-08T23:49:24Z

The previous sigmoid-based attention sink implementation was mathematically incorrect. This fix uses proper LSE (Log-Sum-Exp) renormalization that is equivalent to HuggingFace's concat+softmax approach.

Mathematical equivalence:

HF approach: concat sink logit to scores, softmax over K+1, drop sink position
Our approach: compute combined_lse = logsumexp([lse, sink]), then renormalize output by exp(old_lse - new_lse)

Changes:

Replace sigmoid(lse - sink) with proper LSE renormalization
Add clamping [-20, 0] for numerical stability
Add comprehensive test suite validating equivalence to HF reference

Reference: HuggingFace transformers/integrations/flex_attention.py lines 309-322

The previous sigmoid-based attention sink implementation was mathematically incorrect. This fix uses proper LSE (Log-Sum-Exp) renormalization that is equivalent to HuggingFace's concat+softmax approach. Mathematical equivalence: - HF approach: concat sink logit to scores, softmax over K+1, drop sink position - Our approach: compute combined_lse = logsumexp([lse, sink]), then renormalize output by exp(old_lse - new_lse) Changes: - Replace sigmoid(lse - sink) with proper LSE renormalization - Add clamping [-20, 0] for numerical stability - Add comprehensive test suite validating equivalence to HF reference Reference: HuggingFace transformers/integrations/flex_attention.py lines 309-322

Copilot

Pull request overview

This PR fixes a mathematical error in the attention sink implementation by replacing an incorrect sigmoid-based approach with proper LSE (Log-Sum-Exp) renormalization. The new implementation is mathematically equivalent to HuggingFace's concat+softmax approach and includes comprehensive test coverage.

Key Changes:

Replace sigmoid-based attention sink rescaling with LSE renormalization in the attention mechanism
Add numerical stability clamping to the renormalization factor computation
Implement comprehensive test suite validating mathematical equivalence with HuggingFace reference implementation

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File	Description
torchtitan/models/gpt_oss/model/model.py	Replaced incorrect sigmoid(lse - sink) with proper LSE renormalization using logsumexp and exp(lse - combined_lse), with clamping for numerical stability
torchtitan/models/gpt_oss/tests/test_attention_sink.py	Added comprehensive test suite with equivalence tests, probability mass preservation checks, and edge case validation against HuggingFace reference

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-08T23:53:07Z

torchtitan/models/gpt_oss/tests/test_attention_sink.py

+        values = torch.randn(batch_size, num_heads, seq_len, head_dim)
+
+        # Per-head sink weights (typically small negative to positive)
+        sinks = torch.randn(num_heads) * 2  # Range roughly [-4, 4]


The comment states the range is "roughly [-4, 4]" but torch.randn() * 2 produces values with mean=0 and std=2, so approximately 99.7% of values fall within [-6, 6] (3 standard deviations). Consider updating the comment to reflect the actual approximate range as "roughly [-6, 6]".

Suggested change

sinks = torch.randn(num_heads) * 2 # Range roughly [-4, 4]

sinks = torch.randn(num_heads) * 2 # Range roughly [-6, 6]

Copilot · 2026-01-08T23:53:07Z

torchtitan/models/gpt_oss/tests/test_attention_sink.py

+        scores, values, sinks = setup_tensors
+
+        # Standard attention
+        probs = torch.softmax(scores, dim=-1)


Variable probs is not used.

Suggested change

probs = torch.softmax(scores, dim=-1)

- Fix comment: randn()*2 range is [-6, 6] not [-4, 4] - Remove unused probs variable in test_probability_mass_preserved

tianyu-l · 2026-01-20T06:29:02Z

@wwwjn could you take a look?

Copilot AI review requested due to automatic review settings January 8, 2026 23:49

eous requested review from fegin, tianyu-l, wconstab and wwwjn as code owners January 8, 2026 23:49

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026

Copilot started reviewing on behalf of eous January 8, 2026 23:49 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

eous mentioned this pull request Jan 8, 2026

feat(gpt-oss): Add CPU offload optimizer, differential LR/WD, and more #2205

Open

fix: address review feedback for attention sink tests

1085a1e

- Fix comment: randn()*2 range is [-6, 6] not [-4, 4] - Remove unused probs variable in test_probability_mass_preserved

wwwjn self-assigned this Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gpt-oss): correct attention sink from sigmoid to LSE renormalization #2211

fix(gpt-oss): correct attention sink from sigmoid to LSE renormalization #2211

Uh oh!

eous commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

tianyu-l commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	sinks = torch.randn(num_heads) * 2 # Range roughly [-4, 4]
	sinks = torch.randn(num_heads) * 2 # Range roughly [-6, 6]

fix(gpt-oss): correct attention sink from sigmoid to LSE renormalization #2211

Are you sure you want to change the base?

fix(gpt-oss): correct attention sink from sigmoid to LSE renormalization #2211

Uh oh!

Conversation

eous commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants