Skip to content

Conversation

@eous
Copy link

@eous eous commented Jan 8, 2026

Remove overly conservative restriction that disabled mixed precision for TP-only configurations. torch.autocast operates at the operator level and is orthogonal to tensor parallelism.

Before: TP-only training would show warning and disable mixed precision
After: TP-only training uses torch.autocast for mixed precision

Note: PP-only training uses schedule-based execution and doesn't use maybe_enable_amp (unchanged by this PR).

Affected configurations:

  • TP-only (now enabled)
  • DDP-only (was already enabled)
  • Single-device (was already enabled)
  • FSDP/HSDP (unchanged - handled internally by fully_shard)

Remove overly conservative restriction that disabled mixed precision
for TP-only configurations. torch.autocast operates at the operator
level and is orthogonal to tensor parallelism.

Before: TP-only training would show warning and disable mixed precision
After: TP-only training uses torch.autocast for mixed precision

Note: PP-only training uses schedule-based execution and doesn't use
maybe_enable_amp (unchanged by this PR).

Affected configurations:
- TP-only (now enabled)
- DDP-only (was already enabled)
- Single-device (was already enabled)
- FSDP/HSDP (unchanged - handled internally by fully_shard)
Copilot AI review requested due to automatic review settings January 8, 2026 23:49
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes an overly conservative restriction that disabled mixed precision training for Tensor Parallelism (TP) configurations without FSDP. The change enables torch.autocast for TP-only training, recognizing that autocast operates at the operator level and is orthogonal to the parallelism strategy.

Key changes:

  • Simplified maybe_enable_amp function logic to enable autocast for all non-FSDP configurations
  • Improved code comments to clarify when mixed precision is handled by FSDP vs AMP
  • Added explanation that PP uses schedule-based execution and doesn't utilize this context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified it works properly? Could you show evidence, in terms of param / activation / grad dtype, and throughput comparison with mixed precision off?

I vaguely remember that I've tried it before and it didn't work as expected.

@eous
Copy link
Author

eous commented Jan 20, 2026

Have you verified it works properly? Could you show evidence, in terms of param / activation / grad dtype, and throughput comparison with mixed precision off?

I vaguely remember that I've tried it before and it didn't work as expected.

weight_diff_analysis_s3

https://huggingface.co/eousphoros/persona_eta_20b_131k This model was trained with TP=4 no fdsp. The output with autocast was inline with what I expected though I lack the depth of knowledge to formerly confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants