Skip to content

LLVM lowering #4581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

LLVM lowering #4581

wants to merge 33 commits into from

Conversation

wolfcomos
Copy link
Collaborator

Goal: Improve host ir latency by replacing functions using expression evaluators to LLVM JIT ORC. Follow up to #4431

Currently:
Naive Support of output tensor Inference
Naive Support of output stride inference

@jjsjann123 jjsjann123 self-requested a review June 5, 2025 18:17
@wolfcomos
Copy link
Collaborator Author

!test

1 similar comment
@wolfcomos
Copy link
Collaborator Author

!test

@jjsjann123
Copy link
Collaborator

I'm seeing workflow from forked repo requires maintainer approval... I didn't know that's a thing.

I have already added wolfcomos as a developer in the repo. Do we need to close and open a new PR from a branch inside this repo in order to trigger CI? asking @xwang233

@xwang233
Copy link
Collaborator

xwang233 commented Jun 5, 2025

I'm seeing workflow from forked repo requires maintainer approval... I didn't know that's a thing.

I have already added wolfcomos as a developer in the repo. Do we need to close and open a new PR from a branch inside this repo in order to trigger CI? asking @xwang233

Yes, PR has to be initiated from a branch in this repo in order for CI and AI review to work. Sorry about that!

@jjsjann123
Copy link
Collaborator

I'm seeing workflow from forked repo requires maintainer approval... I didn't know that's a thing.
I have already added wolfcomos as a developer in the repo. Do we need to close and open a new PR from a branch inside this repo in order to trigger CI? asking @xwang233

Yes, PR has to be initiated from a branch in this repo in order for CI and AI review to work. Sorry about that!

Oh it's not your fault. I should have known that. @wolfcomos I'm closing this PR. Would you mind pushing the branch directly to nvfuser repo and start PR there instead.

@jjsjann123 jjsjann123 closed this Jun 5, 2025
@wolfcomos
Copy link
Collaborator Author

!test

@xwang233 xwang233 reopened this Jun 5, 2025
@xwang233
Copy link
Collaborator

xwang233 commented Jun 6, 2025

!test

@jjsjann123
Copy link
Collaborator

!test

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself. Need to continue review the transform and utils.

CMakeLists.txt Outdated
set(BUILD_LLVM ON)
if(BUILD_LLVM)
# Add LLVM JIT related dependencies
set(LLVM_DIR "/usr/lib/llvm-18/lib/cmake/llvm")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself. do we need to force an LLVM_DIR? it's under /usr/lib, so I think it's unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevertheless, this probably should have been a -D instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have to specify the folder, instead of hardcoding the folder,

$ llvm-config --libdir
/usr/lib/llvm-18/lib

CMakeLists.txt Outdated
if(BUILD_LLVM)
target_include_directories(test_host_ir PRIVATE ${LLVM_INCLUDE_DIRS})
target_compile_definitions(test_host_ir PRIVATE ${LLVM_DEFINITIONS})
target_link_libraries(test_host_ir PUBLIC ${LLVM_LIBS})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My naive questions is, why the mismatch between the handling of the two target?

For codegen_internal, we only added lib dependencies, but here we also need the new include dir and compile options?

// }


// run with the following command: NVFUSER_ENABLE=host_ir_lowering ./bin/test_host_ir --gtest_filter=HostIrEvaluatorTest.LaunchKernel2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable option thing should be done for the given test suite.

class HostIrIntegrationTest : public NVFuserTest {
protected:
HostIrIntegrationTest() {
EnableOptionsGuard::getCurOptions().set(EnableOption::HostIrLowering);
}
};

@@ -0,0 +1,143 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license header.

@@ -0,0 +1,704 @@
#include <host_ir/lower_to_llvm.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

print_compare_tensor(output_tensor, t0);
}

// TEST_F(HostIrEvaluatorTest, LaunchKernel3) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Comment on lines 48 to 49
tv1->setAllocationDomain({tv1->axis(0), tv1->axis(1), tv1->axis(2)}, {true, true, true});
tv1->printTransforms();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tv1->setAllocationDomain({tv1->axis(0), tv1->axis(1), tv1->axis(2)}, {true, true, true});
tv1->printTransforms();
tv1->setAllocationDomain(tv1->getLoopDomain(), true);

auto allocation_domain = tv1->getAllocationDomain();
auto logical_domain = tv1->getLogicalDomain();
std::unique_ptr<llvm::orc::LLJIT> JIT = llvm_jit_init(4);
llvm_jit_compile_shape_infer(JIT, fusion, logical_domain, logical_domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the third argument be the logical_domain of the input, i.e. tv0->getLogicalDomain().

llvm_jit_compile_stride_infer(JIT, fusion, allocation_domain, logical_domain);

auto func_infer_shape = ExitOnErr(JIT->lookup("infer_shape"));
auto func_infer_stride = ExitOnErr(JIT->lookup("infer_stride"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm curious how our benchmark was done when we were comparing the vanilla HostIr allocation/execution vs the compiled code path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we would want to encapsulate error checking

// Generate shape infer llvm module
llvm::orc::ThreadSafeModule generate_infer_shape_module(std::vector<IterDomain*>& input_logical_domain, std::vector<IterDomain*>& output_logical_domain, Fusion& fusion);

/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only needed the API from this line below. We should hide all the above things in the cpp file if we don't need to expose that.

llvm_jit_compile_stride_infer(JIT, fusion, allocation_domain, logical_domain);

auto func_infer_shape = ExitOnErr(JIT->lookup("infer_shape"));
auto func_infer_stride = ExitOnErr(JIT->lookup("infer_stride"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we would want to encapsulate error checking

#include <llvm/ExecutionEngine/Orc/IRCompileLayer.h>
#include <llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h>
#include <llvm/ExecutionEngine/JITLink/JITLink.h>
#include "llvm/Support/Error.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be able to hide the majority of the includes, as long as we push the implementation internal into the cpp file instead.


// shape infer compile
void llvm_jit_compile_shape_infer(std::unique_ptr<llvm::orc::LLJIT>& JIT, Fusion& fusion, std::vector<IterDomain*>& input_domain, std::vector<IterDomain*>& output_domain){
std::cout << "llvm_jit_compile shape infer" << std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Let's remove the debug print.

If you wanted to keep these for debugging purposes, you can try adding a dump option and put them inside an if block.

std::cout << "llvm_jit_compile shape infer" << std::endl;
auto TSM_shape = generate_infer_shape_module(input_domain, output_domain, fusion);
if (auto Err = JIT->addIRModule(std::move(TSM_shape))) {
llvm::errs() << "Error adding module to JIT: " << llvm::toString(std::move(Err)) << "\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: do we need to use llvm::errs()? There's NVF_ERROR for that.
https://github.com/NVIDIA/Fuser/blob/main/csrc/exceptions.h

}

// shape infer compile
void llvm_jit_compile_shape_infer(std::unique_ptr<llvm::orc::LLJIT>& JIT, Fusion& fusion, std::vector<IterDomain*>& input_domain, std::vector<IterDomain*>& output_domain){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a concatenated full ID on all inputs?

nit, inputs_domain instead.

llvm::Value* input_ptr = &*arg_it;
llvm::Value* output_ptr = &*arg_it+2;
IdModel id_model(&fusion);
const ValGraph& graph = id_model.buildExactGraph();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to repetitively build the same id_model. Let's start thinking about a refactor so we can re-use this across a given fusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. But we can leave it as an optimization later.

IdModel id_model(&fusion);
const ValGraph& graph = id_model.buildExactGraph();
auto exprs = traverse_expr_group(graph, input_logical_domain, output_logical_domain);
for(long unsigned int i = 0; i < input_logical_domain.size(); i++){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We already have the exact graph. We don't need to load everything. i.e. we only needed to load unique values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from this point on, we should treat all the expression at the granularity of ValGroup and ExprGroup, which should get rid of some repetitive expression.
i.e. since llvm ir isn't aware of the mapping, they won't be able to CSE those.

};

// Dependency graph entry for the shape inference
class dependency_graph{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you just wanted a topo order traversal?

We should revisit this... I think we have utils for sorting already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't, we should add that 😉

}

// Allocate the output tensor based on the shape and stride inference
at::Tensor aten_output_allocation(FuncType shape_infer_func, FuncType stride_infer_func, const at::Tensor& input, int64_t output_tensor_dim) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should refactor this to encapsulate these compiled functions.

}

for(auto* val : output_vals){
generate_shape_llvm_ir(val, val2graph, builder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rethink the traversal part. Let's take this offline.

@wolfcomos wolfcomos requested a review from jjsjann123 June 12, 2025 18:32
at::Tensor output_tensor = jit.allocateOutputTensor({t0});

// Print Output Tensor Info
print_tensor_info(output_tensor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: debug print should be removed.

// [N, H, W, C]
tv1->merge(1);
// [N, H*W, C]
tv1->setAllocationDomain(tv1->getLoopDomain(), {true, true, true});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a legal case?

Since we are re-ordering the split last dimension, can we still merge it back? This might need to trigger an assert.

// [N * H, W * C/4, 2, 2]
TensorView* tv1 = set(tv0);

tv1->setAllocationDomain(tv1->getLoopDomain(), {true, true, true, true});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tv1 is not used?

// [N * H, W, C/4, 4]
tv0->split(3,2);
// [N * H, W, C/4, 2, 2]
tv0->merge(1,2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: tv0 transformation doesn't really do anything, since it's done on inputs. Is there anything I missed?

TensorView* tv2 = makeContigConcreteTensor({N, H, W});
fusion.addInput(tv2);
auto tv3 = broadcast(tv2, {false, false, false, true});
auto tv4 = add(tv2, tv3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening with tv4 here?

it should be of shape (N, H, W, 1)?

Comment on lines +773 to +785
// Map the output values to the input values if they are the same
for(auto* val : output_values){
auto index = mapToInputDomain(boundary_vals, val, graph);
if(index != -1){
val2llvm_val[graph.toGroup(val)] = val2llvm_val[graph.toGroup(boundary_vals[index])];
}
}

// Store the output values to the preallocated output buffer
for(size_t i = 0; i < output_values.size(); i++){
auto* output_i_ptr = builder.CreateGEP(int64Ty, output_ptr, builder.getInt64(i), "ptr");
builder.CreateStore(val2llvm_val[graph.toGroup(output_values[i])], output_i_ptr);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two for loops could be merged.

By default, we assume it is in typological order, which means input values are ready to use

*/
void generate_shape_llvm_ir(Expr* expr, llvm::IRBuilder<>& builder, std::unordered_map<ValGroup,llvm::Value*>& val2llvm, std::unordered_map<int, Val*>& boundary_vals, const ValGraph& graph) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need boundary_vals? is this an optimization for simpler math, i.e. just use input shapes whenever possible.

}
else{
input_outer_llvm_val = val2llvm[graph.toGroup(merge_input_outer_val)];
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this pattern a lot vvv

auto* v = ...;
int index = mapToInputDomain(...);
llvm::Value* val = nullptr;
if (index != -1) {
  val = val2llvm[graph.toGroup(boundary_vals[index])];
} else {
  val = val2llvm[graph.toGroup(v)];
}

Maybe make it a lambda to keep the code easier to read.

}
}
// outer = input + 1
llvm::Value* minus_1 = builder.CreateSub(input_llvm_val, builder.getInt64(1), "minus_1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says + 1 but the code is minus_1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are trying to do a ceil_div? looks like the comment is just wrong.

// outer = (input + 1 + inner) / inner
val2llvm[graph.toGroup(split_output_outer_val)] = builder.CreateUDiv(sum_ab, val2llvm[graph.toGroup(split_output_inner_val)], split_output_outer_val->as<IterDomain>()->extent()->toString());
}
else{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this looks like the exact same logic as with inner split. let's merge the logic together.

@@ -834,7 +864,15 @@ if(BUILD_TEST)
${NVFUSER_ROOT}/tests/cpp/test_host_ir_integration.cpp
${NVFUSER_ROOT}/tests/cpp/test_host_ir_stream_lowering.cpp
)
if(BUILD_LLVM)
list(APPEND HOSTIR_TEST_SRCS
${NVFUSER_ROOT}/tests/cpp/test_host_ir_llvm_lowering.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
${NVFUSER_ROOT}/tests/cpp/test_host_ir_llvm_lowering.cpp
${NVFUSER_ROOT}/tests/cpp/test_host_ir_compilation.cpp

LLVM lowering is also correct, but terminology wise I tend to use "lowering" for fusion IR => host IR, and "compilation" for host IR to LLVM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe compile_to_llvm.cpp?

Comment on lines +45 to +46
using ShapeInferFunc = void (*)(const int64_t* input_tensor_shape, int64_t input_tensor_shape_buffer_size,
int64_t* output_tensor_shape, int64_t output_tensor_shape_buffer_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using ShapeInferFunc = void (*)(const int64_t* input_tensor_shape, int64_t input_tensor_shape_buffer_size,
int64_t* output_tensor_shape, int64_t output_tensor_shape_buffer_size);
using ShapeInferFunc = std::function<void(const int64_t*, int64_t, int64_t*, int64_t)>;

// Dependency graph entry for the stride inference
struct StrideInfo {
public:
llvm::Value* llvm_extent = nullptr; // LLVM Value for the extent of this IterDomain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you lintrunner -a?

template <typename T>
T ExitOnErr(llvm::Expected<T> &&E) {
if (!E) {
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError()));
NVF_ERROR(false, "LLVM JIT Initialization Error: ", llvm::toString(E.takeError()));

Comment on lines +30 to +33
void compile(TensorView* output_tv);

// Execute the compiled functions to allocate and return an output tensor.
at::Tensor allocateOutputTensor(const std::vector<at::Tensor>& input_tensors);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is HostIrLlvmJit per TensorView? What happens when I call compile(tv0) and then compile(tv1) on the same object? The function compiled out of tv0 is overwritten?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking this because the end state is unclear to me.

While I'm happy to take experimental/intermediate code, what do you think about the following?

class HostIrLlvmJit {
 public:
  struct CompileOptions {
    int num_threads;
  };

  HostIrLlvmJit(HostIrContainer* container, CompileOptions options);

  // Used for the first stage where we use LLVMJIT only for fast allocation. 
  at::Tensor allocate(kir::Allocate* allocate);

  // Used for the second stage where we use LLVMJIT to run the entire HostIrContainer. At that point, it even makes sense for HostIrLlvmJit to take the ownership of the HostIrContainer, and therefore HostIrLlvmJit(std::unique<HostIrContainer> container, ...)
  KernelArgumentHolder run(const KernelArgumentHolder& inputs);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely shouldn't be per TensorView. We should be able to reuse it across a given HostIrContainer at least. ^^^ the above looks good to me.

if (!E) {
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError()));
llvm::errs() << llvm::toString(E.takeError()) << "\n";
exit(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw an exception? It's not necessarily the end of the world when compilation fails. We could always fall back to interpretation.

std::string op_string = std::string(expr->getOpString());

// Perform the merge -> mul transformation
if(op_string == "Merge"){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(op_string == "Merge"){
if (auto* merge = dynamic_cast<Merge*>(expr)) {

boundary_vals[i] = input_vals[i];
}

IdModel id_model(&fusion);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect IdModel buys you little. For top-level expressions (e.g. allocation), we mostly care about extents, which are mapped across ops by construction, not indices. I also doubt we exercised IdModel at all on non-SSA IR (e.g. host IR and kernel IR).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this one in our meeting offline.

logical_sharded_shape_result.size());

// Create the output tensor with the computed shape and strides
at::Tensor allocated_tensor = at::empty_strided(logical_sharded_shape_result, logical_stride_result, input_tensors[0].options());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far from what I had in mind. I expect LLVM JIT to take a TensorView* and generate a LLVM IR function like the following

at::Tensor allocate_tv1(int64_t i0, int64_t i1, ...) {  // args represent logical domain extents of tv1
  calculate local tensor sizes (e.g. for multi-GPU)
  calculate strides
  return at::empty_strided(local_tensor_sizes, strides) 
}

logical_shape_infer_fn and logical_stride_infer_fn shouldn't even exist. They should be inlined to the above allocate_tv1 function for performance and for simplicity. at::empty_strided should be called by allocate_tv1 rather than the JIT.

I'm less worried about this particular IR than what we'll try to deliver by the end of the internship. I'm happy to discuss this in person.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are on the same page. i.e. eventually we would want to have a better encapsulation.

I'm trying to limit the scope of the PR to have functional things merged in first. We should refactor this in follow up PRs during integration.

}

TEST_F(HostIrLLVMTest, AllocationMergeSplit1) {
Fusion fusion;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. HostIrLlvmJit, similar to HostIrEvaluator, should take HostIrContainer, not Fusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this goes back to the topic on how we would expect the inference to be done. The only reason I suggested to go with fusion directly is just so that I feel more comfortable about IdModel running on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants