You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A linear layer or activation function should allocate a maximum of 1 memory block. And there are possibilities of reuse between different layers. However, current builder allocates a memory block each for operations such as transpose or broadcast. Supposedly the overhead can be eliminated by fusing linalg.transpose with linalg.fill and linalg.matmul. I tried rewriting mlp.py using for loops and succeeded in using only 2 allocations.
Describe the bug
A linear layer or activation function should allocate a maximum of 1 memory block. And there are possibilities of reuse between different layers. However, current builder allocates a memory block each for operations such as transpose or broadcast. Supposedly the overhead can be eliminated by fusing
linalg.transpose
withlinalg.fill
andlinalg.matmul
. I tried rewritingmlp.py
using for loops and succeeded in using only 2 allocations.To Reproduce
Run
mlp.py
with monitor_memory and without enable_tensor. The total number of allocations is ten.+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| name | shape | dtype | mem(bits) | BRAM(18K) | store counts | data storage |
+===========+==========+=========+=============+=============+================+============================================================================+
| %alloc | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %4 = memref.load %0[%arg1, %arg2] : memref<30x30xf32> |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_3 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %8 = arith.addf %6, %7 : f32 |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_10 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %4 = memref.load %1[%arg2] : memref<30xf32> |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_14 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %6 = arith.addf %4, %5 : f32 |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_21 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %4 = memref.load %2[%arg1, %arg2] : memref<30x30xf32> |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_28 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %8 = arith.addf %6, %7 : f32 |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_35 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %4 = memref.load %3[%arg2] : memref<30xf32> |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_39 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %6 = arith.addf %4, %5 : f32 |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_46 | [30, 30] | f32 | 28800 | 1.6384e+06 | 1 | %6 = arith.maxf %4, %5 : f32 |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| %alloc_50 | [30, 30] | f32 | 28800 | 1.6384e+06 | 0 | |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
| Total(10) | | | 288000 | 1.6384e+07 | | *data storage: data stored into an allocated memory. Doesn't include init. |
+-----------+----------+---------+-------------+-------------+----------------+----------------------------------------------------------------------------+
Expected behavior
If rewriting the FFN with for loops:
There are only 2 allocations, which means our current builder is way from optimal.
The text was updated successfully, but these errors were encountered: