Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing max_vect_factor initialization #4076

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Mar 14, 2025

This fixes the performance issue observed in embedding fwd, where the index input with int64_t that didn't participate in vectorization ended up capping our max_vect_factor to 2.

This restores the performance mentioned in #4048

Embedding forward nvfuser design doc.

@jjsjann123
Copy link
Collaborator Author

!test --pybench

@jjsjann123 jjsjann123 requested a review from liqiangxl March 14, 2025 01:59
Copy link

github-actions bot commented Mar 14, 2025

Review updated until commit 3896fc7

Description

  • Fixes max_vect_factor initialization for better performance

  • Updates test cases to reflect new vectorization factor


Changes walkthrough 📝

Relevant files
Enhancement
pointwise.cpp
Improve max_vect_factor calculation                                           

csrc/scheduler/pointwise.cpp

  • Added calculation of max_dtype_size_for_vectorization
  • Updated max_vect_factor calculation to use
    max_dtype_size_for_vectorization
  • +8/-1     
    Tests
    test_index_select.cpp
    Update test vectorization checks                                                 

    tests/cpp/test_index_select.cpp

    • Updated vectorization checks in test cases to expect a factor of 4
    +3/-3     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Performance Impact

    Ensure that the new calculation for max_vect_factor does not introduce performance regressions for other data types or scenarios.

    int64_t max_dtype_size_for_vectorization = 1;
    for (auto inp : vectorizable_inputs_outputs_entry.get()) {
      max_dtype_size_for_vectorization = std::max(
          max_dtype_size_for_vectorization,
          (int64_t)dataTypeSize(inp->getDataType().value(), index_type));
    }
    
    Test Coverage

    Verify that the updated test cases cover a variety of scenarios and data types to ensure the fix is robust.

      checkIndexSelectVectorization(executor_cache, 4, true, false);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    TEST_F(NVFuserTest, IndexSelectVectorizationLookupTensorCase1) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeContigTensor(2);
      fusion.addInput(tv0);
      auto tv1 = makeContigTensor(1, DataType::Int);
      fusion.addInput(tv1);
    
      // slicing on fastest dimension
      // This will map to vectorized load on lookup tensor tv0
      // But it shouldn't have any vectorized load on lookup tensor
      auto tv2 = indexSelect(tv0, 1, tv1);
      fusion.addOutput(tv2);
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
    
      std::vector<int64_t> shape1({1028, 1024});
      std::vector<int64_t> shape2({512});
      auto t0 = at::randn(shape1, options);
      auto t1 = at::randint(0, shape1[1], shape2, options_i);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      // lookup tv [ 1028, 1024 ]
      // index  tv [ 512]
      // output tv [ 1028, 512 ]
      // output tv and index tv share the innermost dimension 512. We'll have
      // vectorized store and load on index tv
      checkIndexSelectVectorization(executor_cache, 2, false, true);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    TEST_F(NVFuserTest, IndexSelectVectorizationIndexTensor) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeContigTensor(2);
      fusion.addInput(tv0);
      auto tv1 = makeContigTensor(1, DataType::Int);
      fusion.addInput(tv1);
    
      auto tv2 = indexSelect(tv0, 0, tv1);
      fusion.addOutput(tv2);
    
      // Indices dimension as the fastest dimension
      // This will map to vectorized load on indices tensor tv1.
      tv2->setAllocationDomain({tv2->axis(1), tv2->axis(0)}, true);
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
      std::vector<int64_t> shape1({1024, 1024});
      std::vector<int64_t> shape2({768});
      auto t0 = at::randn(shape1, options);
      auto t1 = at::randint(0, shape1[0], shape2, options_i);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      // lookup tv [ 1024, 1024 ]
      // index  tv [ 768 ]
      // output tv [ 768,  1024 ] (stride [1, 768])
      // output tv and index tv share the innermost dimension 768. We'll have
      // vectorized store and load on index tv
      checkIndexSelectVectorization(executor_cache, 2, false, true);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    TEST_F(NVFuserTest, IndexSelectVectorizationIndexTensorNoBroadcast) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeContigTensor(2);
      fusion.addInput(tv0);
      auto tv1 = TensorViewBuilder()
                     .ndims(2)
                     .shape({-1, 1})
                     .dtype(DataType::Int)
                     .contiguity({true, std::nullopt})
                     .build();
      fusion.addInput(tv1);
    
      auto tv2 = indexSelect(tv0, 0, tv1);
      fusion.addOutput(tv2);
    
      // Indices dimension as the fastest dimension
      // This will map to vectorized load on indices tensor tv1.
      tv2->setAllocationDomain({tv2->axis(1), tv2->axis(0)}, true);
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
      std::vector<int64_t> shape1({1024, 1024});
      std::vector<int64_t> shape2({768, 1});
      auto t0 = at::randn(shape1, options);
      auto t1 = at::randint(0, shape1[0], shape2, options_i);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      // lookup tv [ 1024, 1024 ]
      // index  tv [ 768, 1 ]
      // output tv [ 768,  1024 ] (stride [1, 768])
      // output tv and index tv share the innermost dimension 768. We'll have
      // vectorized store and load on index tv
      checkIndexSelectVectorization(executor_cache, 2, false, true);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    TEST_F(NVFuserTest, IndexSelectVectorization3DCase0) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeContigTensor(3);
      fusion.addInput(tv0);
      auto tv1 = makeContigTensor(1, DataType::Int);
      fusion.addInput(tv1);
    
      auto tv2 = indexSelect(tv0, 0, tv1);
      fusion.addOutput(tv2);
    
      tv2->setAllocationDomain({tv2->axis(0), tv2->axis(2), tv2->axis(1)}, true);
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
    
      std::vector<int64_t> shape1({1024, 256, 4});
      std::vector<int64_t> shape2({768});
      auto t0 = at::randn(shape1, options);
      auto t1 = at::randint(0, shape1[0], shape2, options_i);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      // lookup tv [ 1024, 256, 4 ]
      // index  tv [ 768 ]
      // output tv [ 768,  256, 4 ] (stride [ 1024, 1, 256 ])
      // output tv doesn't share the innermost dimension with inputs. We'll have
      // vectorized store only
      checkIndexSelectVectorization(executor_cache, 4, false, false);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    TEST_F(NVFuserTest, IndexSelectVectorization3DCase1) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = TensorViewBuilder()
                     .ndims(3)
                     .contiguity({true, true, true})
                     .strideOrder({2, 0, 1})
                     .build();
      fusion.addInput(tv0);
      auto tv1 = makeContigTensor(1, DataType::Int);
      fusion.addInput(tv1);
    
      auto tv2 = indexSelect(tv0, 0, tv1);
      fusion.addOutput(tv2);
    
      tv2->setAllocationDomain({tv2->axis(0), tv2->axis(2), tv2->axis(1)}, true);
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
    
      std::vector<int64_t> shape1({1024, 256, 4});
      std::vector<int64_t> shape2({768});
      auto t0 = at::randn(shape1, options).as_strided(shape1, {256 * 4, 1, 256});
      auto t1 = at::randint(0, shape1[0], shape2, options_i);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      // lookup tv [ 1024, 256, 4 ] (stride [ 1024, 1, 256 ])
      // index  tv [ 768 ]
      // output tv [ 768,  256, 4 ] (stride [ 1024, 1, 256 ])
      // output tv and lookup tv share the innermost dimension 1024. We'll have
      // vectorized store and load on lookup tv
      checkIndexSelectVectorization(executor_cache, 4, true, false);
      testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);

    constexpr int64_t kSixteen = 16; // clang tidy
    auto max_vect_factor = ceilDiv(
    // Available vectorization based on size of data type
    (int64_t)kSixteen / max_input_dtype_size,
    (int64_t)kSixteen / max_dtype_size_for_vectorization,
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think I should use this only for max_vect_factor?

    I see max_input_dtype_size has also be used for unroll, so I should leave that as-is and not replace it with the max_dtype_size_for_vectorization.
    cc'ing @liqiangxl

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I looked at some parts of the code, such as getEmpiricalUnrollFactor. It seems it shouldn't use the max size but it should use the size of each individual vectorized input.

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 marked this pull request as ready for review March 22, 2025 00:07
    @jjsjann123
    Copy link
    Collaborator Author

    image

    This PR fixed the vectorization factor to 4 and improved performance of our index_select benchmark

    @jjsjann123 jjsjann123 requested a review from naoyam March 22, 2025 00:09
    @@ -295,10 +295,17 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics(
    largest_out, true, true));
    });

    int64_t max_dtype_size_for_vectorization = 1;
    for (auto inp : vectorizable_inputs_outputs_entry.get()) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This seems to be an obvious mistake. It seems the transpose scheduler is fine, but the reduction scheduler is not. Can you also fix it as well?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants