Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Basic latency tests #4053

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[WIP] Basic latency tests #4053

wants to merge 3 commits into from

Conversation

csarofeen
Copy link
Collaborator

No description provided.

Copy link

github-actions bot commented Mar 9, 2025

Description

  • Refactor MatmulOp::evaluate to simplify output handling.

  • Rename DistributedTensor to Sharding and update related methods.

  • Modify FusionDefinition::execute to return output shardings.

  • Add latency tests for expression evaluation.


Changes walkthrough 📝

Relevant files
Enhancement
14 files
nodes.cpp
Simplify MatmulOp::evaluate output handling                           
+16/-16 
distributed_tensor.cpp
Rename DistributedTensor to Sharding                                         
+2/-3     
fusion_definition.cpp
Modify execute to return output shardings                               
+61/-41 
multidevice_bindings.cpp
Update bindings for Sharding                                                         
+3/-6     
python_bindings.cpp
Update execute method to handle output shardings                 
+16/-6   
__init__.py
Update execute method signature                                                   
+11/-6   
test_communication.py
Update execute method calls                                                           
+8/-8     
test_dtensor.py
Update execute method calls and handle output shardings   
+14/-17 
test_multidevice.py
Update execute method calls and handle output shardings   
+35/-37 
instrumentation.h
Remove event tracing methods                                                         
+3/-16   
distributed_tensor.h
Rename DistributedTensor to Sharding                                         
+6/-17   
fusion_definition.h
Update execute method signature                                                   
+10/-5   
executor_kernel_arg.h
Add vector method to KernelArgumentHolder                               
+4/-0     
fusion_kernel_runtime.h
Add NVF_API to FusionKernelRuntime class                                 
+1/-1     
Tests
2 files
test_matmul_perf.cpp
Add latency tests for expression evaluation                           
+301/-0 
CMakeLists.txt
Add test_matmul_perf.cpp to JIT_TEST_SRCS                               
+1/-0     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Code Comment

The commented-out code in MatmulOp::evaluate might be useful for future development or debugging. Consider keeping it or documenting why it was removed.

// if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
//     rfactor_did_idx != -1) {
//   matmul_out = matmul_out.unsqueeze(rfactor_did_idx);
// }

// const auto& [sizes, strides] = inferShapeOfOutput(out(), ee);
// auto meta_out = at::detail::empty_strided_meta(sizes, strides, a.dtype());

// if (meta_out.is_contiguous()) {
//   return {matmul_out};
// }

// auto strided_matmul_out = at::empty_strided(sizes, strides, a.options());
// strided_matmul_out = strided_matmul_out.copy_(matmul_out);
// return {strided_matmul_out};
Performance Concern

The performance test in tests/cpp/test_matmul_perf.cpp should be evaluated against a baseline to ensure that the changes do not introduce performance regressions.

      os << "\n";
    }
  }
  os << std::endl;
}

namespace {
// Returns the output shardings of the given fusion. As a short cut, if none of
// the outputs have a device mesh, returns an empty vector indicating single-GPU
// execution.
std::vector<Sharding> getOutputShardings(Fusion* fusion) {
  std::vector<Sharding> output_shardings;
  if (std::none_of(
          fusion->outputs().begin(), fusion->outputs().end(), [](Val* v) {
            if (auto* tv = dynamic_cast<TensorView*>(v)) {
              return tv->hasDeviceMesh();
            }
            return false;
          })) {
    return output_shardings;
  }

  output_shardings.reserve(fusion->outputs().size());
  for (Val* out_val : fusion->outputs()) {
    if (auto* out_tv = dynamic_cast<TensorView*>(out_val)) {
      if (fusion->getOutputAlias(out_tv).hide_output) {
        continue;
      }
      const DeviceMesh& mesh = out_tv->getDeviceMesh();
      Sharding& output_sharding = output_shardings.emplace_back(mesh);
      if (mesh.size() > 0) {
        for (const ParallelType parallel_type : kParallelTypeDIDs) {
          if (const auto axis = getShardedLogicalAxis(out_tv, parallel_type);
              axis != -1) {
            output_sharding.setAxisIsShardedOn(axis, parallel_type);
          }
        }
      }
    } else {
      output_shardings.emplace_back(DeviceMesh());
    }
  }

  return output_shardings;
}
} // namespace

std::pair<KernelArgumentHolder, std::vector<Sharding>> FusionDefinition::
    execute(
        KernelArgumentHolder args,
        std::optional<int8_t> selected_device,
        bool override_user_schedule,
        bool capture_debug_output,
        bool profile,
        std::vector<std::string> _enable_options,
        std::vector<std::string> _disable_options) const {
  debug_output_ = std::nullopt;
  std::stringstream debug_ss;
  DebugStreamGuard dsg(capture_debug_output ? debug_ss : std::cout);
  args.setDeviceIndex(selected_device);
  NVF_CHECK(id().has_value(), "Valid fusion schedule is not available!");

  auto scheds = fusionCache()->queryFusionSchedules(id().value());

  if (profile) {
    ProfilerOptionsGuard::getCurOptions().set(ProfilerOption::Enable);
  }

  EnableOptionsGuard enable_opt_guard;
  for (const auto& _enable_option : _enable_options) {
    std::optional<EnableOption> opt = stringToEnableOption(_enable_option);
    NVF_CHECK(opt.has_value(), "Unrecognized enable_option: ", _enable_option);
    EnableOptionsGuard::getCurOptions().set(opt.value());
  }

  DisableOptionsGuard disable_opt_guard;
  for (const auto& _disable_option : _disable_options) {
    std::optional<DisableOption> opt = stringToDisableOption(_disable_option);
    NVF_CHECK(
        opt.has_value(), "Unrecognized disable_option: ", _disable_option);
    DisableOptionsGuard::getCurOptions().set(opt.value());
  }

  auto find_user_schedule = [&]() -> const UserSchedule* {
    if (override_user_schedule) {
      return nullptr;
    }

    auto user_sched_id = fusionCache()->queryUserScheduleId(scheds, args);
    if (!user_sched_id.has_value()) {
      return nullptr;
    }

    NVF_CHECK(
        args.empty() || args.getDeviceIndex() > -1,
        "Inputs are not all on the same device or don't match selection!");
    const UserSchedule& user_sched = fusionCache()->queryUserSchedule(
        scheds, user_sched_id.value(), args.getDeviceIndex());
    return &user_sched;
  };
  const auto* user_sched = find_user_schedule();

  KernelArgumentHolder outputs;
  if (user_sched == nullptr) {
    scheds->createExecutorIfNotExists();
    outputs = scheds->auto_gen_schedules->runFusionWithInputs(
        args, std::nullopt, args.getDeviceIndex());
  } else {
    if (isProfilerEnabledWithCupti()) {
      FusionProfiler::start();
      FusionProfiler::createSegments(1);
    }

    scheds->last_user_def_scheduled_ir = user_sched->scheduled_fusion.get();
    scheds->last_user_def_executor = user_sched->executor.get();

    if (user_sched->heuristic_params == nullptr) {
      // Manual schedule
      if (!user_sched->executor->isCompiled()) {
        user_sched->executor->compile(user_sched->scheduled_fusion.get(), args);
      }
      outputs = user_sched->executor->run(args);
    } else {
      // Automatic scheduler was used for UserSchedule.
      // Pass launch and compile params to compileFusion and runFusion.
      if (!user_sched->executor->isCompiled()) {
        user_sched->executor->compile(
            user_sched->scheduled_fusion.get(),
            args,
            user_sched->heuristic_params->lparams,
            user_sched->heuristic_params->cparams,
            user_sched->heuristic_params->scheduler_type);
      }
      outputs = user_sched->executor->run(
          args,
          {},
          user_sched->heuristic_params->lparams,
          user_sched->heuristic_params->cparams);
    }

    if (isProfilerEnabledWithCupti()) {
      FusionProfiler::segment(0).scheduler("user");
      FusionProfiler::stop();
      if (isProfilerPrintingEnabled()) {
        debug() << FusionProfiler::profile();
      }
    }
  }

  if (profile) {
    ProfilerOptionsGuard::getCurOptions().unset(ProfilerOption::Enable);
  }
Functionality Loss

The beginEvent and endEvent methods in Trace class are now empty, which might remove important tracing functionality. Ensure that this is intentional and that tracing is handled elsewhere if necessary.

public:
 using Clock = std::chrono::steady_clock;
 int64_t times_called_ = 0;

@csarofeen csarofeen changed the title Basic latency tests [WIP] Basic latency tests Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant