-
Notifications
You must be signed in to change notification settings - Fork 61
LLVM lowering #4581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LLVM lowering #4581
Conversation
!test |
1 similar comment
!test |
I'm seeing workflow from forked repo requires maintainer approval... I didn't know that's a thing. I have already added wolfcomos as a developer in the repo. Do we need to close and open a new PR from a branch inside this repo in order to trigger CI? asking @xwang233 |
Yes, PR has to be initiated from a branch in this repo in order for |
Oh it's not your fault. I should have known that. @wolfcomos I'm closing this PR. Would you mind pushing the branch directly to nvfuser repo and start PR there instead. |
!test |
!test |
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself. Need to continue review the transform and utils.
CMakeLists.txt
Outdated
set(BUILD_LLVM ON) | ||
if(BUILD_LLVM) | ||
# Add LLVM JIT related dependencies | ||
set(LLVM_DIR "/usr/lib/llvm-18/lib/cmake/llvm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself. do we need to force an LLVM_DIR? it's under /usr/lib, so I think it's unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevertheless, this probably should have been a -D
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have to specify the folder, instead of hardcoding the folder,
$ llvm-config --libdir
/usr/lib/llvm-18/lib
CMakeLists.txt
Outdated
if(BUILD_LLVM) | ||
target_include_directories(test_host_ir PRIVATE ${LLVM_INCLUDE_DIRS}) | ||
target_compile_definitions(test_host_ir PRIVATE ${LLVM_DEFINITIONS}) | ||
target_link_libraries(test_host_ir PUBLIC ${LLVM_LIBS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My naive questions is, why the mismatch between the handling of the two target?
For codegen_internal, we only added lib dependencies, but here we also need the new include dir and compile options?
// } | ||
|
||
|
||
// run with the following command: NVFUSER_ENABLE=host_ir_lowering ./bin/test_host_ir --gtest_filter=HostIrEvaluatorTest.LaunchKernel2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enable option thing should be done for the given test suite.
Fuser/tests/cpp/test_host_ir_integration.cpp
Lines 75 to 80 in 43c9e13
class HostIrIntegrationTest : public NVFuserTest { | |
protected: | |
HostIrIntegrationTest() { | |
EnableOptionsGuard::getCurOptions().set(EnableOption::HostIrLowering); | |
} | |
}; |
csrc/host_ir/lower_to_llvm.h
Outdated
@@ -0,0 +1,143 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license header.
@@ -0,0 +1,704 @@ | |||
#include <host_ir/lower_to_llvm.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
print_compare_tensor(output_tensor, t0); | ||
} | ||
|
||
// TEST_F(HostIrEvaluatorTest, LaunchKernel3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
tv1->setAllocationDomain({tv1->axis(0), tv1->axis(1), tv1->axis(2)}, {true, true, true}); | ||
tv1->printTransforms(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tv1->setAllocationDomain({tv1->axis(0), tv1->axis(1), tv1->axis(2)}, {true, true, true}); | |
tv1->printTransforms(); | |
tv1->setAllocationDomain(tv1->getLoopDomain(), true); |
auto allocation_domain = tv1->getAllocationDomain(); | ||
auto logical_domain = tv1->getLogicalDomain(); | ||
std::unique_ptr<llvm::orc::LLJIT> JIT = llvm_jit_init(4); | ||
llvm_jit_compile_shape_infer(JIT, fusion, logical_domain, logical_domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the third argument be the logical_domain of the input, i.e. tv0->getLogicalDomain().
llvm_jit_compile_stride_infer(JIT, fusion, allocation_domain, logical_domain); | ||
|
||
auto func_infer_shape = ExitOnErr(JIT->lookup("infer_shape")); | ||
auto func_infer_stride = ExitOnErr(JIT->lookup("infer_stride")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I'm curious how our benchmark was done when we were comparing the vanilla HostIr allocation/execution vs the compiled code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we would want to encapsulate error checking
csrc/host_ir/lower_to_llvm.h
Outdated
// Generate shape infer llvm module | ||
llvm::orc::ThreadSafeModule generate_infer_shape_module(std::vector<IterDomain*>& input_logical_domain, std::vector<IterDomain*>& output_logical_domain, Fusion& fusion); | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only needed the API from this line below. We should hide all the above things in the cpp file if we don't need to expose that.
llvm_jit_compile_stride_infer(JIT, fusion, allocation_domain, logical_domain); | ||
|
||
auto func_infer_shape = ExitOnErr(JIT->lookup("infer_shape")); | ||
auto func_infer_stride = ExitOnErr(JIT->lookup("infer_stride")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we would want to encapsulate error checking
csrc/host_ir/lower_to_llvm.h
Outdated
#include <llvm/ExecutionEngine/Orc/IRCompileLayer.h> | ||
#include <llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h> | ||
#include <llvm/ExecutionEngine/JITLink/JITLink.h> | ||
#include "llvm/Support/Error.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also be able to hide the majority of the includes, as long as we push the implementation internal into the cpp file instead.
csrc/host_ir/lower_to_llvm.cpp
Outdated
|
||
// shape infer compile | ||
void llvm_jit_compile_shape_infer(std::unique_ptr<llvm::orc::LLJIT>& JIT, Fusion& fusion, std::vector<IterDomain*>& input_domain, std::vector<IterDomain*>& output_domain){ | ||
std::cout << "llvm_jit_compile shape infer" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Let's remove the debug print.
If you wanted to keep these for debugging purposes, you can try adding a dump option and put them inside an if block.
csrc/host_ir/lower_to_llvm.cpp
Outdated
std::cout << "llvm_jit_compile shape infer" << std::endl; | ||
auto TSM_shape = generate_infer_shape_module(input_domain, output_domain, fusion); | ||
if (auto Err = JIT->addIRModule(std::move(TSM_shape))) { | ||
llvm::errs() << "Error adding module to JIT: " << llvm::toString(std::move(Err)) << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: do we need to use llvm::errs()
? There's NVF_ERROR for that.
https://github.com/NVIDIA/Fuser/blob/main/csrc/exceptions.h
csrc/host_ir/lower_to_llvm.cpp
Outdated
} | ||
|
||
// shape infer compile | ||
void llvm_jit_compile_shape_infer(std::unique_ptr<llvm::orc::LLJIT>& JIT, Fusion& fusion, std::vector<IterDomain*>& input_domain, std::vector<IterDomain*>& output_domain){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a concatenated full ID on all inputs?
nit, inputs_domain
instead.
llvm::Value* input_ptr = &*arg_it; | ||
llvm::Value* output_ptr = &*arg_it+2; | ||
IdModel id_model(&fusion); | ||
const ValGraph& graph = id_model.buildExactGraph(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to repetitively build the same id_model. Let's start thinking about a refactor so we can re-use this across a given fusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. But we can leave it as an optimization later.
csrc/host_ir/lower_to_llvm.cpp
Outdated
IdModel id_model(&fusion); | ||
const ValGraph& graph = id_model.buildExactGraph(); | ||
auto exprs = traverse_expr_group(graph, input_logical_domain, output_logical_domain); | ||
for(long unsigned int i = 0; i < input_logical_domain.size(); i++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We already have the exact graph. We don't need to load everything. i.e. we only needed to load unique values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think from this point on, we should treat all the expression at the granularity of ValGroup and ExprGroup, which should get rid of some repetitive expression.
i.e. since llvm ir isn't aware of the mapping, they won't be able to CSE those.
csrc/host_ir/lower_to_llvm.h
Outdated
}; | ||
|
||
// Dependency graph entry for the shape inference | ||
class dependency_graph{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, you just wanted a topo order traversal?
We should revisit this... I think we have utils for sorting already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't, we should add that 😉
csrc/host_ir/lower_to_llvm.cpp
Outdated
} | ||
|
||
// Allocate the output tensor based on the shape and stride inference | ||
at::Tensor aten_output_allocation(FuncType shape_infer_func, FuncType stride_infer_func, const at::Tensor& input, int64_t output_tensor_dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should refactor this to encapsulate these compiled functions.
csrc/host_ir/lower_to_llvm.cpp
Outdated
} | ||
|
||
for(auto* val : output_vals){ | ||
generate_shape_llvm_ir(val, val2graph, builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rethink the traversal part. Let's take this offline.
…stride calculation
Llvm lowering dev
Llvm lowering dev
Llvm lowering dev
Llvm lowering dev
Llvm lowering dev
at::Tensor output_tensor = jit.allocateOutputTensor({t0}); | ||
|
||
// Print Output Tensor Info | ||
print_tensor_info(output_tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: debug print should be removed.
// [N, H, W, C] | ||
tv1->merge(1); | ||
// [N, H*W, C] | ||
tv1->setAllocationDomain(tv1->getLoopDomain(), {true, true, true}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a legal case?
Since we are re-ordering the split last dimension, can we still merge it back? This might need to trigger an assert.
// [N * H, W * C/4, 2, 2] | ||
TensorView* tv1 = set(tv0); | ||
|
||
tv1->setAllocationDomain(tv1->getLoopDomain(), {true, true, true, true}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tv1 is not used?
// [N * H, W, C/4, 4] | ||
tv0->split(3,2); | ||
// [N * H, W, C/4, 2, 2] | ||
tv0->merge(1,2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: tv0 transformation doesn't really do anything, since it's done on inputs. Is there anything I missed?
TensorView* tv2 = makeContigConcreteTensor({N, H, W}); | ||
fusion.addInput(tv2); | ||
auto tv3 = broadcast(tv2, {false, false, false, true}); | ||
auto tv4 = add(tv2, tv3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's happening with tv4 here?
it should be of shape (N, H, W, 1)?
// Map the output values to the input values if they are the same | ||
for(auto* val : output_values){ | ||
auto index = mapToInputDomain(boundary_vals, val, graph); | ||
if(index != -1){ | ||
val2llvm_val[graph.toGroup(val)] = val2llvm_val[graph.toGroup(boundary_vals[index])]; | ||
} | ||
} | ||
|
||
// Store the output values to the preallocated output buffer | ||
for(size_t i = 0; i < output_values.size(); i++){ | ||
auto* output_i_ptr = builder.CreateGEP(int64Ty, output_ptr, builder.getInt64(i), "ptr"); | ||
builder.CreateStore(val2llvm_val[graph.toGroup(output_values[i])], output_i_ptr); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two for loops could be merged.
By default, we assume it is in typological order, which means input values are ready to use | ||
|
||
*/ | ||
void generate_shape_llvm_ir(Expr* expr, llvm::IRBuilder<>& builder, std::unordered_map<ValGroup,llvm::Value*>& val2llvm, std::unordered_map<int, Val*>& boundary_vals, const ValGraph& graph) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need boundary_vals
? is this an optimization for simpler math, i.e. just use input shapes whenever possible.
} | ||
else{ | ||
input_outer_llvm_val = val2llvm[graph.toGroup(merge_input_outer_val)]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing this pattern a lot vvv
auto* v = ...;
int index = mapToInputDomain(...);
llvm::Value* val = nullptr;
if (index != -1) {
val = val2llvm[graph.toGroup(boundary_vals[index])];
} else {
val = val2llvm[graph.toGroup(v)];
}
Maybe make it a lambda to keep the code easier to read.
} | ||
} | ||
// outer = input + 1 | ||
llvm::Value* minus_1 = builder.CreateSub(input_llvm_val, builder.getInt64(1), "minus_1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says + 1 but the code is minus_1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are trying to do a ceil_div? looks like the comment is just wrong.
// outer = (input + 1 + inner) / inner | ||
val2llvm[graph.toGroup(split_output_outer_val)] = builder.CreateUDiv(sum_ab, val2llvm[graph.toGroup(split_output_inner_val)], split_output_outer_val->as<IterDomain>()->extent()->toString()); | ||
} | ||
else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: this looks like the exact same logic as with inner split. let's merge the logic together.
@@ -834,7 +864,15 @@ if(BUILD_TEST) | |||
${NVFUSER_ROOT}/tests/cpp/test_host_ir_integration.cpp | |||
${NVFUSER_ROOT}/tests/cpp/test_host_ir_stream_lowering.cpp | |||
) | |||
if(BUILD_LLVM) | |||
list(APPEND HOSTIR_TEST_SRCS | |||
${NVFUSER_ROOT}/tests/cpp/test_host_ir_llvm_lowering.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${NVFUSER_ROOT}/tests/cpp/test_host_ir_llvm_lowering.cpp | |
${NVFUSER_ROOT}/tests/cpp/test_host_ir_compilation.cpp |
LLVM lowering is also correct, but terminology wise I tend to use "lowering" for fusion IR => host IR, and "compilation" for host IR to LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe compile_to_llvm.cpp?
using ShapeInferFunc = void (*)(const int64_t* input_tensor_shape, int64_t input_tensor_shape_buffer_size, | ||
int64_t* output_tensor_shape, int64_t output_tensor_shape_buffer_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using ShapeInferFunc = void (*)(const int64_t* input_tensor_shape, int64_t input_tensor_shape_buffer_size, | |
int64_t* output_tensor_shape, int64_t output_tensor_shape_buffer_size); | |
using ShapeInferFunc = std::function<void(const int64_t*, int64_t, int64_t*, int64_t)>; |
// Dependency graph entry for the stride inference | ||
struct StrideInfo { | ||
public: | ||
llvm::Value* llvm_extent = nullptr; // LLVM Value for the extent of this IterDomain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you lintrunner -a
?
template <typename T> | ||
T ExitOnErr(llvm::Expected<T> &&E) { | ||
if (!E) { | ||
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError())); | |
NVF_ERROR(false, "LLVM JIT Initialization Error: ", llvm::toString(E.takeError())); |
void compile(TensorView* output_tv); | ||
|
||
// Execute the compiled functions to allocate and return an output tensor. | ||
at::Tensor allocateOutputTensor(const std::vector<at::Tensor>& input_tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is HostIrLlvmJit per TensorView? What happens when I call compile(tv0)
and then compile(tv1)
on the same object? The function compiled out of tv0
is overwritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking this because the end state is unclear to me.
While I'm happy to take experimental/intermediate code, what do you think about the following?
class HostIrLlvmJit {
public:
struct CompileOptions {
int num_threads;
};
HostIrLlvmJit(HostIrContainer* container, CompileOptions options);
// Used for the first stage where we use LLVMJIT only for fast allocation.
at::Tensor allocate(kir::Allocate* allocate);
// Used for the second stage where we use LLVMJIT to run the entire HostIrContainer. At that point, it even makes sense for HostIrLlvmJit to take the ownership of the HostIrContainer, and therefore HostIrLlvmJit(std::unique<HostIrContainer> container, ...)
KernelArgumentHolder run(const KernelArgumentHolder& inputs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely shouldn't be per TensorView. We should be able to reuse it across a given HostIrContainer at least. ^^^ the above looks good to me.
if (!E) { | ||
NVF_ERROR(false, "LLVM JIT Initialization Error: " + llvm::toString(E.takeError())); | ||
llvm::errs() << llvm::toString(E.takeError()) << "\n"; | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw an exception? It's not necessarily the end of the world when compilation fails. We could always fall back to interpretation.
std::string op_string = std::string(expr->getOpString()); | ||
|
||
// Perform the merge -> mul transformation | ||
if(op_string == "Merge"){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(op_string == "Merge"){ | |
if (auto* merge = dynamic_cast<Merge*>(expr)) { |
boundary_vals[i] = input_vals[i]; | ||
} | ||
|
||
IdModel id_model(&fusion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect IdModel buys you little. For top-level expressions (e.g. allocation), we mostly care about extents, which are mapped across ops by construction, not indices. I also doubt we exercised IdModel at all on non-SSA IR (e.g. host IR and kernel IR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this one in our meeting offline.
logical_sharded_shape_result.size()); | ||
|
||
// Create the output tensor with the computed shape and strides | ||
at::Tensor allocated_tensor = at::empty_strided(logical_sharded_shape_result, logical_stride_result, input_tensors[0].options()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is far from what I had in mind. I expect LLVM JIT to take a TensorView* and generate a LLVM IR function like the following
at::Tensor allocate_tv1(int64_t i0, int64_t i1, ...) { // args represent logical domain extents of tv1
calculate local tensor sizes (e.g. for multi-GPU)
calculate strides
return at::empty_strided(local_tensor_sizes, strides)
}
logical_shape_infer_fn and logical_stride_infer_fn shouldn't even exist. They should be inlined to the above allocate_tv1
function for performance and for simplicity. at::empty_strided should be called by allocate_tv1
rather than the JIT.
I'm less worried about this particular IR than what we'll try to deliver by the end of the internship. I'm happy to discuss this in person.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on the same page. i.e. eventually we would want to have a better encapsulation.
I'm trying to limit the scope of the PR to have functional things merged in first. We should refactor this in follow up PRs during integration.
} | ||
|
||
TEST_F(HostIrLLVMTest, AllocationMergeSplit1) { | ||
Fusion fusion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. HostIrLlvmJit, similar to HostIrEvaluator, should take HostIrContainer, not Fusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this goes back to the topic on how we would expect the inference to be done. The only reason I suggested to go with fusion directly is just so that I feel more comfortable about IdModel running on that.
Goal: Improve host ir latency by replacing functions using expression evaluators to LLVM JIT ORC. Follow up to #4431
Currently:
Naive Support of output tensor Inference
Naive Support of output stride inference