This tutorial logically follows the Using Dynamatic tutorial, and as such requires that you are already familiar with the concepts touched on in the latter. In this tutorial, we will write a small compiler optimization pass in C++ that will transform dataflow muxes into merges in an attempt to optimize our circuits' area and throughput. While we will write a little bit of C++ in this tutorial, it does not require much knowledge in the language.
Below are some technical details about this tutorial.
- All resources are located in the repository's
tutorials/Introduction/
folder. Data exclusive to this chapter is located in theCh2
subfolder, but we will also reuse data from the previous chapter. - All relative paths mentionned throughout the tutorial are assumed to start at Dynamatic's top-level folder.
- We assume that you have already built Dynamatic from source using the instructions in the top-level README or that you have access to a Docker container that has a pre-built version of Dynamatic .
This tutorial is divided into the following sections.
- Spotting an optimization opportunity | We take another look at the circuit from the previous tutorial and spot something that looks optimizable.
- Writing a small compiler pass | We implement the optimization as a compiler pass, and add it the compilation script to use it.
- Testing our pass | We test our pass to make sure it works as intended, and find out that it may not.
- A problem, and a solution! | After identifying a problem in one of our circuits, we implement a quick-and-dirty fix to make the circuit correct again.
- Conclusion | We reflect on everything we just accomplished.
Let's start by re-considering the [same loop_multiply
kernel](tutorials/Introduction/Ch1/loop_multiply.c
from the previous tutorial. See its definition below.
// The kernel under consideration
unsigned loop_multiply(in_int_t a[N]) {
unsigned x = 2;
for (unsigned i = 0; i < N; ++i) {
if (a[i] == 0)
x = x * x;
}
return x;
}
This simple kernel multiplies a number by itself at each iteration of a simple loop from 0 to any number N
where the corresponding element of an array equals 0. The function returns the calculated value after the loop exits.
If you have deleted the data generated by the synthesis flow on this kernel, you can regenerate it fully using the loop-multiply.dyn
frontend script that has already been written for you. Just run the following command from Dynamatic's top-level folder.
./bin/dynamatic --run tutorials/Introduction/Ch2/loop-multiply.dyn
This will compile the C kernel, functionally verify the generated VHDL, and re-open the dataflow visualizer. Note the [INFO] Simulation succeeded
message in the output (after the simulate
command), indicating that outputs of the VHDL design matched those of the original C kernel. All output files are generated in tutorials/Introduction/usingDynamatic/out
.
Tip
Identify all muxes in the circuit and derive their purpose in this circuit. Remember that muxes have an arbitrary number of data
inputs (here it is always 2) and one select
input, which selects which valid data
input gets forwarded to the output. Note that, in general, the select
input of muxes if generated by the index
output of the same block's control merge.
Another dataflow component that is similar to the mux in purpose is the merge. Identically to the mux, the merge has an arbitrary number of data
inputs, one of which gets forwarded to the output when it is valid. However, the two dataflow components have two key differences.
- The merge does not have a
select
input. Instead, at any given cycle, if any of its data input is valid and if its data output is ready, it will transfer a token to the output. - The merge does not provide any guarantee on input consumption order if at any given cycle multiple of its inputs are valid and its data output if ready. In those situations, it will simply transfer one of its input tokens to its output.
Due to this "simpler" interface, a merge is generally smaller in area than a corresponding mux with the same number of data
inputs. Replacing a mux with a merge may also speed up circuit execution since the merge does not have to wait for the arrival of a valid select
token to transfer one of its data
inputs to its output.
Let's try to make this circuit smaller by writing a compiler pass that will automatically replace all muxes with equivalent merges then!
In this section, we will add a small transformation pass that achieves the optimization opportunity we identified in the previous section. We will not go into much details into how C++ or MLIR works, our focus will be instead in writing something minimal that accomplishes the job cleanly. For a more complete tutorial on pass-writing, feel free to go through the "Creating Passes" tutorial after completing this one.
Creating this pass will involve creating 2 new source files and making minor editions to 3 existing source files. In order, we will
- Declare the pass in TableGen (a LLVM/MLIR language that eventually transpiles to C++).
- Write a minimal C++ header for the pass.
- Implement the pass in C++.
- Make new source file we created part of Dynamatic's build process.
- Edit a generic header to make our pass visible to Dynamatic's optimizer.
The first thing we need to do is declare our pass somewhere. In LLVM/MLIR, this happens in the TableGen language, a declarative format that ultimately transpiles to C++ during the build process to automatically generate a lot of boilerplate C++ code.
Open the include/dynamatic/Transforms/Passes.td file and copy-and-paste the following snippet anywhere below the include lines at the top of the file.
def HandshakeMuxToMerge : DynamaticPass<"handshake-mux-to-merge"> {
let summary = "Transform all muxes into merges.";
let description = [{
Transform all muxes within the IR into merges with identical data operands.
}];
let constructor = "dynamatic::createHandshakeMuxToMerge()";
}
This declares a compiler pass whose C++ class name will be based on HandshakeMuxToMerge
and which can be called using the --handshake-mux-to-merge
flag from Dynamatic's optimizer (we will go into more details into using Dynamatic's optimizer in the "Testing our pass" section). The summary
and description
fields are optional but useful to describe the pass's purpose. Finally, the constructor
field indicates the name of a C++ function that should returns an instance of our pass. We will declare and then define this method in the next two subsections.
We now need to write a small C++ header for our new pass. Each pass has one, and they are for the large part always structured in the same exact way. Create a file in include/dynamatic/Transforms
called HandshakeMuxToMerge.h
and paste the following chunk of code into it.
/// Classical C-style header guard
#ifndef DYNAMATIC_TRANSFORMS_HANDSHAKEMUXTOMERGE_H
#define DYNAMATIC_TRANSFORMS_HANDSHAKEMUXTOMERGE_H
/// Include some basic headers
#include "dynamatic/Support/DynamaticPass.h"
#include "dynamatic/Support/LLVM.h"
#include "mlir/Pass/Pass.h"
namespace dynamatic {
/// The following include file is autogenerated by LLVM/MLIR during the build
/// process from the Passes.td file we just edited. We only want to include the
/// part of the file that refers to our pass (it contains delcaration code for
/// all transformation passes), which we select using the two macros below.
#define GEN_PASS_DECL_HANDSHAKEMUXTOMERGE
#define GEN_PASS_DEF_HANDSHAKEMUXTOMERGE
#include "dynamatic/Transforms/Passes.h.inc"
/// The pass constructor, with the same name we specified in TableGen in the
/// previous subsection.
std::unique_ptr<dynamatic::DynamaticPass> createHandshakeMuxToMerge();
} // namespace dynamatic
#endif // DYNAMATIC_TRANSFORMS_HANDSHAKEMUXTOMERGE_H
This file does two important things:
- It includes C++ code auto-generated from the
Passes.td
file we just edited. - It declares the pass header that we announced in the pass's TableGen declaration.
Now that all declarations are made, it is time to actually implement our IR transformation!
Create a file in lib/Transforms
called HandshakeMuxToMerge.cpp
and in which we will implement our pass. Paste the following code into it.
/// Include the header we just created.
#include "dynamatic/Transforms/HandshakeMuxToMerge.h"
/// Include some other useful headers.
#include "dynamatic/Dialect/Handshake/HandshakeOps.h"
#include "dynamatic/Support/CFG.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
using namespace dynamatic;
namespace {
/// Simple driver for the pass that replaces all muxes with merges.
struct HandshakeMuxToMergePass
: public dynamatic::impl::HandshakeMuxToMergeBase<HandshakeMuxToMergePass> {
void runDynamaticPass() override {
// This is the top-level operation in all MLIR files. All the IR is nested
// within it
mlir::ModuleOp mod = getOperation();
MLIRContext *ctx = &getContext();
// Define the set of rewrite patterns we want to apply to the IR
RewritePatternSet patterns(ctx);
// Run a greedy pattern rewriter on the entire IR under the top-level module
// operation
mlir::GreedyRewriteConfig config;
if (failed(applyPatternsAndFoldGreedily(mod, std::move(patterns), config))) {
// If the greedy pattern rewriter fails, the pass must also fail
return signalPassFailure();
}
};
};
}; // namespace
/// Implementation of our pass constructor, which just returns an instance of
/// the `HandshakeMuxToMergePass` struct.
std::unique_ptr<dynamatic::DynamaticPass>
dynamatic::createHandshakeMuxToMerge() {
return std::make_unique<HandshakeMuxToMergePass>();
}
This file, at the botton, implements the pass constructor we declared in the header. This constuctor returns an instance of a struct defined just above---do not mind the slightly convoluted struct declaration, which showcases the curiously recurring template pattern C++ idiom that is used extensively throuhgout MLIR/Dynamatic---whose single method runDynamaticPass
defines what happens when the pass is called. In our case, we want to leverage MLIR's greedy pattern rewriter infrastructure to match on all muxes in the IR and replace them with merges with identical data inputs. If you would like to know more about how greedy pattern rewriting works, feel free to check out MLIR's official documentation on the subject. For this simple pass, you do not need to understand exactly how it works, just that it can match and try to rewrite certain operations inside the IR based on a set of user-provided rewrite patterns. Speaking of rewrite patterns, let's add our own to the file just above the HandshakeMuxToMergePass
struct definition. Paste the following into the file.
/// Rewrite pattern that will match on all muxes in the IR and replace each of
/// them with a merge taking the same inputs (except the `select` input which
/// merges do not have due to their undeterministic nature).
struct ReplaceMuxWithMerge : public OpRewritePattern<handshake::MuxOp> {
using OpRewritePattern<handshake::MuxOp>::OpRewritePattern;
LogicalResult matchAndRewrite(handshake::MuxOp muxOp,
PatternRewriter &rewriter) const override {
// Retrieve all mux inputs except the `select`
ValueRange dataOperands = muxOp.getDataOperands();
// Create a merge in the IR at the mux's position and with the same data
// inputs (or operands, in MLIR jargon)
handshake::MergeOp mergeOp =
rewriter.create<handshake::MergeOp>(muxOp.getLoc(), dataOperands);
// Make the merge part of the same basic block (BB) as the mux
inheritBB(muxOp, mergeOp);
// Retrieve the merge's output (or result, in MLIR jargon)
Value mergeResult = mergeOp.getResult();
// Replace usages of the mux's output with the new merge's output
rewriter.replaceOp(muxOp, mergeResult);
// Signal that the pattern succeeded in rewriting the mux
return success();
}
};
This rewrite pattern, called ReplaceMuxWithMerge
, matches on operations of type handshake::MuxOp
(the mux operation is part of the Handshake dialect) as indicated by its declaration. Eahc time the greedy pattern rewriter finds a mux in the IR, it will call the pattern's matchAndRewrite
method, providing it with the particular operation it matched on as well as with a PatternRewriter
object to allow us to modify the IR. For this simple pass, we want to transform all muxes into merges so the rewrite pattern is very short:
- First, we extract the mux's data inputs.
- Then, we create a merge operation at the same location in the IR and with the same data inputs.
- Finally, we tell the rewriter to replace the mux with the merge. This "rewires" the IR by making users of the mux's output channel use the merge's output channel instead, and deleted the original mux.
To complete the pass implementation, we simply have to provide the rewrite pattern to the greedy pattern rewriter. Just add the following call to patterns.add
inside runDynamaticPass
after declaring the pattern set.
RewritePatternSet patterns(ctx);
patterns.add<ReplaceMuxWithMerge>(ctx);
Congratulations! You have now implemented your first Dynamatic pass. We just have two simple file edits to make before we can start using it.
We need to make the build process aware of the new source file we just wrote. Navigate to lib/Transforms/CMakeLists.txt
and add the name of thefile you created in the previous section next to other .cpp
files in the add_dynamatic_library
statement.
add_dynamatic_library(DynamaticTransforms
HandshakeMuxToMerge.cpp # Add this line
ArithReduceStrength.cpp
... # other .cpp files
DEPENDS
...
)
Finally, we need to make Dynamatic's optimizer aware of our new pass. Navigate to include/dynamatic/Transforms/Passes.h
and add the header you wrote a couple of subsections ago to the list of include files.
#ifndef DYNAMATIC_TRANSFORMS_PASSES_H
#define DYNAMATIC_TRANSFORMS_PASSES_H
#include "dynamatic/Transforms/HandshakeMuxToMerge.h" // Add this line
... // other include files
Now that the pass is part of the project's source code, we just have to partially re-build Dynamatic to use it. Simply navigate to the top-level build
directory from the terminal and run ninja
.
cd build && ninja && cd ..
If you see Build successfull
printed on the terminal, then everything worked and the pass is now part of Dynamatic. Let's go modify our compilation script---which is called by the frontend's compile
command---to run it as part of the normal synthesis flow.
Open tools/dynamatic/scripts/compile.sh
and locate the following call to Dynamatic's optimizer.
# handshake transformations
"$DYNAMATIC_OPT_BIN" "$F_HANDSHAKE" \
--handshake-minimize-lsq-usage \
--handshake-concretize-index-type="width=32" \
--handshake-minimize-cst-width --handshake-optimize-bitwidths="legacy" \
--handshake-materialize --handshake-infer-basic-blocks \
> "$F_HANDSHAKE_TRANSFORMED"
exit_on_fail "Failed to apply transformations to handshake" \
"Applied transformations to handshake"
This is a compilation step where we apply a number of optimizations/transformation to our Handshake-level IR for performance and correctness, and is thus a perfect place to insert our new pass. Remember that we declared our pass in Tablegen to be associated with the --handshake-mux-to-merge
optimizer flag. We just have to add the flag to the optimizer call to run our new pass.
# handshake transformations
"$DYNAMATIC_OPT_BIN" "$F_HANDSHAKE" \
--handshake-mux-to-merge \
--handshake-minimize-lsq-usage \
--handshake-concretize-index-type="width=32" \
--handshake-minimize-cst-width --handshake-optimize-bitwidths="legacy" \
--handshake-materialize --handshake-infer-basic-blocks \
> "$F_HANDSHAKE_TRANSFORMED"
exit_on_fail "Failed to apply transformations to handshake" \
"Applied transformations to handshake"
Done! Now you can re-run the same frontend script as earlier (./bin/dynamatic --run tutorials/Introduction/Ch2/loop-multiply.dyn
) to see the results of your work! Note that the circuit still functionally verifies during the simulate
step as the frontend prints [INFO] Simulation succeeded
.
Tip
Notice that all muxes have been turned into merges. Also observe that there are no control merges left in the circuit. Indeed, a control merge is just a merge with an additional index
output indicating which valid data
input was selected. The IR no longer uses any of these index
outputs since muxes have been deleted, so Dynamatic automatically downgraded all control merges to simpler and cheaper merges to save on circuit area.
Surely this will work on all circuits, which will from now on all be smaller than before, right?
Just to be sure, let's try our optimization on a different yet similar C kernel called loop_store
.
// The number of loop iterations
#define N 8
// The kernel under consideration
void loop_store(inout_int_t a[N]) {
for (unsigned i = 0; i < N; ++i) {
unsigned x = i;
if (a[i] == 0)
x = x * x;
a[i] = x;
}
}
You can find the source code of this function in tutorials/Introduction/Ch2/loop_store.c
.
This has the same rough structure as our previous example, except that now the kernel stores the squared iteration index in the array at each iteration where the corresponding array element is 0; otherwise it stores the index itself.
Now run the tutorials/Introduction/Ch2/loop-store.dyn
frontend script. It is almost identical to the previous frontend script we used; its only difference is that it synthesizes loop_store.c
instead of loop_multiply.c
.
./bin/dynamatic --run tutorials/Introduction/Ch2/loop-store.dyn
Observe the frontend's output when running simulate
. You should see the following.
dynamatic> simulate
[INFO] Built kernel for IO gen.
[INFO] Ran kernel for IO gen.
[INFO] Launching Modelsim simulation
[ERROR COMPARE] Token mismatch: [0x00000000] and [0x00000001] are not equal (at transaction id 0).
[FATAL] Simulation failed
That's bad! It means that the content of the kernel's input array a
was different after exceution of the C code and after simulation of the generated VHDL design for it. Our optimization broke something in the dataflow circuit, yielding an incorrect result.
Tip
If you would like, you can make sure that it is indeed our new pass that broke the circuit by removing the --handshake-mux-to-merge
flag from the compile.sh
script and re-running the loop-store.dyn
frontend script. You will see that the frontend prints [INFO] Sumulation succeeded
instead of the failure message we just saw.
Let's go check the simulate
command's output folder to see the content of the array a
before and after the kernel. First, open the file tutorials/Introduction/Ch2/out/sim/INPUT_VECTORS/input_a.dat
. This contains the initial content of array a
before the kernel executes. Each line between the [[transation]]
tags represent one element of the array, in order. As you can see, elements at even indices have value 0 whereas elements at odd indices have value 1.
[[[runtime]]]
[[transaction]] 0
0x00000000
0x00000001
0x00000000
0x00000001
0x00000000
0x00000001
0x00000000
0x00000001
[[/transaction]]
[[[/runtime]]]
Looking back at our C kernel, we then should expect that every element at an even index becomes the square of its index, whereas elements at at odd index become their index. This is indeed what we see in tutorials/Introduction/Ch2/out/sim/C_OUT/output_a.dat
, which stores the array's content after kernel execution.
[[[runtime]]]
[[transaction]] 0
0x00000000
0x00000001
0x00000004
0x00000003
0x00000010
0x00000005
0x00000024
0x00000007
[[/transaction]]
[[[/runtime]]]
Tip
Let's now see what the array a
looks like after simulation of our dataflow circuit. Open tutorials/Introduction/Ch2/out/sim/VHDL_OUT/output_a.dat
and compare it with the C output.
[[[runtime]]]
[[transaction]] 0
0x00000001
0x00000000
0x00000003
0x00000004
0x00000005
0x00000010
0x00000007
0x00000024
[[/transaction]]
[[[/runtime]]]
This is significantly different! It looks like elements are shuffled compared to the expected output, as if they were being reordered by the circuit. Let's look at the dataflow visualizer on this new dataflow circuit and try to find out what happened.
Tip
As the simulation's output indicates, the array's content is wrong even at the first iteration. We expect 0 to be stored in the array but instead we get a 1. To debug this problem, iterate through the simulation's cycles and locate the first time that the store port (mc_store0
) transfers a token to the memory controller (mem_controller0
). Then, from the circuit's structure, infer which input to the mc_store0
node is the store address, and which is the store data.
We are especially interested in the store's data input, since it is the one feeding incorrect tokens into the array.
Tip
Once you have identified the store's data input and the first cycle at which it transfers a token to the memory controller, backtrack through cycles to see where the data token came from. You should notice something that should not be happening there. Remember that this is the first time the store transmits to the memory so the data token is supposed to come from the multiplier (mul1
) since a[0] := 0
at the beginning. Also remember that the issue must ultimately come from a merge, since those are the only components we modified with our pass.
By replacing the mux previosuly in the place of merge10
, we caused data tokens to arrive reordered at the store port, hence creating incorrect writes to memory! This is due to the fact that the loop's throughput is much higher when the if branch is not taken, since the multiplier has a latency of 4 cycles while most of our other components have 0 sequential latency.
Let's go verify that we are correct by modifying manually the IR that ultimately gets transformed into the dataflow circuit and re-simulating. Open the tutorials/Introduction/Ch2/out/comp/handshake_export.mlir
MLIR file. It contains the last version of MLIR-formatted IR that gets transformed into a Graphviz-formatted file and then in a VHDL design. While the syntax may be a bit daunting at first, do not worry as we will only modify two lines to "revert" the transformation of the mux into merge10
. The tutorial's goal is not to teach you MLIR syntax, so we will not go into details into how the IR is formatted in text. To give you an idea, the syntax of an operation is usually as follows.
<SSA results> = <operation name> <SSA operands> {<operation attributes>} : <return types>
Back to our faulty IR; on line 31, you should see the following.
%23 = merge %22, %16 {bb = 3 : ui32, name = #handshake.name<"merge10">} : i10
As the name
operation attribute indicates, this is the faulty merge10
we identified in the visualizer. Replace the entire line with an equivalent mux.
%23 = mux %muxIndex [%22, %16] {bb = 3 : ui32, name = #handshake.name<"my_mux">} : i1, i10
Before the square brackets is the mux's select
operand: %muxIndex
. This SSA value currently does not exist in the IR, since it used to come from block 3's control merge that has since then been downgraded to a simple merge due to its index
output becoming unused. Let's upgrade it again, it is located on line 40.
%32 = merge %trueResult_2, %falseResult_3 {bb = 3 : ui32, name = #handshake.name<"merge2">} : none
Replace it with
%32, %muxIndex = control_merge %trueResult_2, %falseResult_3 {bb = 3 : ui32, name = #handshake.name<"my_control_merge">} : none, i1
And you are done! For convenience we provide a little shell script that will only run the part of the synthesis flow that comes after this file is generated. It will regenerate the VHDL design from the MLIR file, simulate it, and open the visualizer. From Dynamatic's top-level folder, run the provided shell script
./tutorials/Introduction/Ch2/partial-flow.sh
You should now see that simulation succeeds!
Tip
Study the fixed circuit in the visualizer to confirm that a mux is indeed necessary to ensure proper ordering of data tokens to the store port.
As we just saw, our pass does not work in every situation. While it is possible to replace some muxes by merges when there is no risk of token re-ordering, this is not true in general for all merges. You would need to design a proper strategy to identify which muxes can be transformed into simpler merges and which are necessary to ensure correct circuit behavior. If you ever design such an algorithm, please consider making a pull request to Dynamatic! We accept contibutions ;)