fixing max_vect_factor initialization #4076

jjsjann123 · 2025-03-14T01:59:27Z

This fixes the performance issue observed in embedding fwd, where the index input with int64_t that didn't participate in vectorization ended up capping our max_vect_factor to 2.

This restores the performance mentioned in #4048

Embedding forward nvfuser design doc.

jjsjann123 · 2025-03-14T01:59:52Z

!test --pybench

github-actions · 2025-03-14T02:00:14Z

Review updated until commit 3896fc7

Description

Fixes max_vect_factor initialization for better performance
Updates test cases to reflect new vectorization factor

Changes walkthrough 📝

Relevant files

Enhancement

pointwise.cpp `Improve max_vect_factor calculation` csrc/scheduler/pointwise.cpp Added calculation of max_dtype_size_for_vectorization Updated max_vect_factor calculation to use max_dtype_size_for_vectorization	+8/-1

Tests

test_index_select.cpp `Update test vectorization checks` tests/cpp/test_index_select.cpp Updated vectorization checks in test cases to expect a factor of 4	+3/-3

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Performance Impact

Ensure that the new calculation for max_vect_factor does not introduce performance regressions for other data types or scenarios.

int64_t max_dtype_size_for_vectorization = 1;
for (auto inp : vectorizable_inputs_outputs_entry.get()) {
  max_dtype_size_for_vectorization = std::max(
      max_dtype_size_for_vectorization,
      (int64_t)dataTypeSize(inp->getDataType().value(), index_type));
}

Test Coverage

Verify that the updated test cases cover a variety of scenarios and data types to ensure the fix is robust.

  checkIndexSelectVectorization(executor_cache, 4, true, false);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(NVFuserTest, IndexSelectVectorizationLookupTensorCase1) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeContigTensor(2);
  fusion.addInput(tv0);
  auto tv1 = makeContigTensor(1, DataType::Int);
  fusion.addInput(tv1);

  // slicing on fastest dimension
  // This will map to vectorized load on lookup tensor tv0
  // But it shouldn't have any vectorized load on lookup tensor
  auto tv2 = indexSelect(tv0, 1, tv1);
  fusion.addOutput(tv2);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);

  std::vector<int64_t> shape1({1028, 1024});
  std::vector<int64_t> shape2({512});
  auto t0 = at::randn(shape1, options);
  auto t1 = at::randint(0, shape1[1], shape2, options_i);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});

  // lookup tv [ 1028, 1024 ]
  // index  tv [ 512]
  // output tv [ 1028, 512 ]
  // output tv and index tv share the innermost dimension 512. We'll have
  // vectorized store and load on index tv
  checkIndexSelectVectorization(executor_cache, 2, false, true);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(NVFuserTest, IndexSelectVectorizationIndexTensor) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeContigTensor(2);
  fusion.addInput(tv0);
  auto tv1 = makeContigTensor(1, DataType::Int);
  fusion.addInput(tv1);

  auto tv2 = indexSelect(tv0, 0, tv1);
  fusion.addOutput(tv2);

  // Indices dimension as the fastest dimension
  // This will map to vectorized load on indices tensor tv1.
  tv2->setAllocationDomain({tv2->axis(1), tv2->axis(0)}, true);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
  std::vector<int64_t> shape1({1024, 1024});
  std::vector<int64_t> shape2({768});
  auto t0 = at::randn(shape1, options);
  auto t1 = at::randint(0, shape1[0], shape2, options_i);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});

  // lookup tv [ 1024, 1024 ]
  // index  tv [ 768 ]
  // output tv [ 768,  1024 ] (stride [1, 768])
  // output tv and index tv share the innermost dimension 768. We'll have
  // vectorized store and load on index tv
  checkIndexSelectVectorization(executor_cache, 2, false, true);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(NVFuserTest, IndexSelectVectorizationIndexTensorNoBroadcast) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeContigTensor(2);
  fusion.addInput(tv0);
  auto tv1 = TensorViewBuilder()
                 .ndims(2)
                 .shape({-1, 1})
                 .dtype(DataType::Int)
                 .contiguity({true, std::nullopt})
                 .build();
  fusion.addInput(tv1);

  auto tv2 = indexSelect(tv0, 0, tv1);
  fusion.addOutput(tv2);

  // Indices dimension as the fastest dimension
  // This will map to vectorized load on indices tensor tv1.
  tv2->setAllocationDomain({tv2->axis(1), tv2->axis(0)}, true);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);
  std::vector<int64_t> shape1({1024, 1024});
  std::vector<int64_t> shape2({768, 1});
  auto t0 = at::randn(shape1, options);
  auto t1 = at::randint(0, shape1[0], shape2, options_i);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});

  // lookup tv [ 1024, 1024 ]
  // index  tv [ 768, 1 ]
  // output tv [ 768,  1024 ] (stride [1, 768])
  // output tv and index tv share the innermost dimension 768. We'll have
  // vectorized store and load on index tv
  checkIndexSelectVectorization(executor_cache, 2, false, true);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(NVFuserTest, IndexSelectVectorization3DCase0) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeContigTensor(3);
  fusion.addInput(tv0);
  auto tv1 = makeContigTensor(1, DataType::Int);
  fusion.addInput(tv1);

  auto tv2 = indexSelect(tv0, 0, tv1);
  fusion.addOutput(tv2);

  tv2->setAllocationDomain({tv2->axis(0), tv2->axis(2), tv2->axis(1)}, true);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);

  std::vector<int64_t> shape1({1024, 256, 4});
  std::vector<int64_t> shape2({768});
  auto t0 = at::randn(shape1, options);
  auto t1 = at::randint(0, shape1[0], shape2, options_i);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});

  // lookup tv [ 1024, 256, 4 ]
  // index  tv [ 768 ]
  // output tv [ 768,  256, 4 ] (stride [ 1024, 1, 256 ])
  // output tv doesn't share the innermost dimension with inputs. We'll have
  // vectorized store only
  checkIndexSelectVectorization(executor_cache, 4, false, false);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(NVFuserTest, IndexSelectVectorization3DCase1) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = TensorViewBuilder()
                 .ndims(3)
                 .contiguity({true, true, true})
                 .strideOrder({2, 0, 1})
                 .build();
  fusion.addInput(tv0);
  auto tv1 = makeContigTensor(1, DataType::Int);
  fusion.addInput(tv1);

  auto tv2 = indexSelect(tv0, 0, tv1);
  fusion.addOutput(tv2);

  tv2->setAllocationDomain({tv2->axis(0), tv2->axis(2), tv2->axis(1)}, true);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto options_i = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0);

  std::vector<int64_t> shape1({1024, 256, 4});
  std::vector<int64_t> shape2({768});
  auto t0 = at::randn(shape1, options).as_strided(shape1, {256 * 4, 1, 256});
  auto t1 = at::randint(0, shape1[0], shape2, options_i);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});

  // lookup tv [ 1024, 256, 4 ] (stride [ 1024, 1, 256 ])
  // index  tv [ 768 ]
  // output tv [ 768,  256, 4 ] (stride [ 1024, 1, 256 ])
  // output tv and lookup tv share the innermost dimension 1024. We'll have
  // vectorized store and load on lookup tv
  checkIndexSelectVectorization(executor_cache, 4, true, false);
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);

jjsjann123 · 2025-03-14T02:01:17Z

csrc/scheduler/pointwise.cpp

  constexpr int64_t kSixteen = 16; // clang tidy
  auto max_vect_factor = ceilDiv(
      // Available vectorization based on size of data type
-      (int64_t)kSixteen / max_input_dtype_size,
+      (int64_t)kSixteen / max_dtype_size_for_vectorization,


I think I should use this only for max_vect_factor?

I see max_input_dtype_size has also be used for unroll, so I should leave that as-is and not replace it with the max_dtype_size_for_vectorization.
cc'ing @liqiangxl

I looked at some parts of the code, such as getEmpiricalUnrollFactor. It seems it shouldn't use the max size but it should use the size of each individual vectorized input.

jjsjann123 · 2025-03-21T02:27:13Z

!test

jjsjann123 · 2025-03-21T23:49:34Z

!test

jjsjann123 · 2025-03-22T00:09:18Z

This PR fixed the vectorization factor to 4 and improved performance of our index_select benchmark

naoyam · 2025-03-22T00:28:51Z

csrc/scheduler/pointwise.cpp

@@ -295,10 +295,17 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics(
                largest_out, true, true));
      });

+  int64_t max_dtype_size_for_vectorization = 1;
+  for (auto inp : vectorizable_inputs_outputs_entry.get()) {


This seems to be an obvious mistake. It seems the transpose scheduler is fine, but the reduction scheduler is not. Can you also fix it as well?

fixing max_vect_factor initialization

fc89af9

jjsjann123 requested a review from liqiangxl March 14, 2025 01:59

jjsjann123 commented Mar 14, 2025

View reviewed changes

Merge branch 'main' into max_vect_factor_patch

52b0f55

jjsjann123 added 2 commits March 21, 2025 16:37

Merge remote-tracking branch 'origin/main' into HEAD

294e6c0

fixing test vectorization check

3896fc7

jjsjann123 marked this pull request as ready for review March 22, 2025 00:07

jjsjann123 requested a review from naoyam March 22, 2025 00:09

naoyam reviewed Mar 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing max_vect_factor initialization #4076

fixing max_vect_factor initialization #4076

jjsjann123 commented Mar 14, 2025 •

edited

Loading

jjsjann123 commented Mar 14, 2025

github-actions bot commented Mar 14, 2025 •

edited

Loading

jjsjann123 Mar 14, 2025

naoyam Mar 22, 2025

jjsjann123 commented Mar 21, 2025

jjsjann123 commented Mar 21, 2025

jjsjann123 commented Mar 22, 2025

naoyam Mar 22, 2025

fixing max_vect_factor initialization #4076

Are you sure you want to change the base?

fixing max_vect_factor initialization #4076

Conversation

jjsjann123 commented Mar 14, 2025 • edited Loading

jjsjann123 commented Mar 14, 2025

github-actions bot commented Mar 14, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

jjsjann123 Mar 14, 2025

Choose a reason for hiding this comment

naoyam Mar 22, 2025

Choose a reason for hiding this comment

jjsjann123 commented Mar 21, 2025

jjsjann123 commented Mar 21, 2025

jjsjann123 commented Mar 22, 2025

naoyam Mar 22, 2025

Choose a reason for hiding this comment

jjsjann123 commented Mar 14, 2025 •

edited

Loading

github-actions bot commented Mar 14, 2025 •

edited

Loading