Skip to content

Tensor-parallelize the DeepSeek V3 transformer layer #4062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 19, 2025
Merged

Conversation

wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Mar 12, 2025

Copy link

github-actions bot commented Mar 12, 2025

Review updated until commit 98cadce

Description

  • Added multidevice test for DeepSeek V3 transformer layer

  • Parallelized transformer layer using Rowwise and Colwise parallelism

  • Moved test from test_deepseek_v3.py to multidevice/test_deepseek_v3.py


Changes walkthrough 📝

Relevant files
Enhancement
test_deepseek_v3.py
Add multidevice test for DeepSeek V3 transformer layer     

tests/python/multidevice/test_deepseek_v3.py

  • Added new test file for multidevice testing of DeepSeek V3 transformer
    layer
  • Implemented setup_process_group fixture for initializing process group
  • Added default_tensor_type context manager for setting default tensor
    type and device
  • Implemented test_transformer_layer to test parallelized transformer
    layer
  • +143/-0 
    Other
    test_deepseek_v3.py
    Remove old test_transformer_layer                                               

    tests/python/test_deepseek_v3.py

    • Removed old test_transformer_layer function
    +0/-60   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Timeout Risk

    The test timed out once when downloading the model configuration. This could be a transient issue, but it's worth investigating to ensure it doesn't happen consistently.

    # This test timed out once when downloading
    # "/deepseek-ai/DeepSeek-V3/resolve/main/configuration_deepseek.py" (cf.
    # http://nv/eCm). I consider this a one-off, but please let me know if this
    # error becomes consistent.
    Hardcoded Port

    The default port for the process group initialization is hardcoded. This could lead to conflicts if multiple tests are run simultaneously. Consider using a dynamic port assignment.

    backend="nccl",
    init_method="tcp://localhost:29500",
    world_size=communicator.size(),
    Device Mesh Initialization

    The device mesh is initialized with a hardcoded device type ("cuda"). This could be problematic if the test is run in an environment without CUDA support. Consider making the device type configurable.

    mesh = dist.device_mesh.init_device_mesh("cuda", [d])

    @wujingyue wujingyue changed the base branch from main to wjy/v3 March 12, 2025 06:21
    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue requested a review from syed-ahmed March 13, 2025 21:29
    Base automatically changed from wjy/v3 to main March 14, 2025 16:04
    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue requested a review from kevinstephano April 11, 2025 20:36
    Copy link
    Collaborator

    @kevinstephano kevinstephano left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM.

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue wujingyue merged commit c969903 into main Apr 19, 2025
    28 of 29 checks passed
    @wujingyue wujingyue deleted the wjy/parallel branch April 19, 2025 04:10
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants