Skip to content

Conversation

tianshijing
Copy link
Contributor

What does this PR do?

The distributed training of Muon was carefully considered.

  1. ​​Distributed Training Support​​: Added gradient synchronization via reduce_scatter_tensor and parameter updates via all_gather_into_tensor for proper distributed training.
  2. ​​Performance Optimization​​: Implemented communication-computation overlap with asynchronous operations when enabled (overlap_comm=True).
    ​​3. Memory Efficiency​​: Only allocates communication buffers in distributed mode and uses gradient sharding to minimize memory usage.
    ​​4. Robustness​​: Enhanced error handling with assertions for 2D parameters and better None gradient management.
    ​​5. Backward Compatibility​​: Maintains original functionality for non-distributed cases while adding distributed capabilities.
    Fixes # (issue)

Before submitting

logger.info_rank0(
f"Using Muon optimizer with {len(muon_params)} Muon params and {len(adamw_params)} AdamW params."
f"Using Muon optimizer with {len(muon_params)} Muon params and {len(adamw_params)} AdamW params. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看起来重复了。

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants