T.tile.reduce_max(out: Buffer, buffer: Buffer, tmp: Buffer, dim: int)
T.tile.reduce_xxx primitives require to pass a tmp buffer to function properly.
Currently, shape allocation of tmp varies across different examples. It seems that the appropriate shape of tmp is indeterminate.
For example:
In examples/normalization/layer_norm.py
tmp_ub = T.alloc_ub([3 * DataType(dtype).bits // 8 * block_M // VEC_NUM * block_N], "uint8")
In examples/softmax/example_online_softmax.py
tmp = T.alloc_ub([2 * sub_block_M * block_N], "uint8")
In examples/lightning_indexer/example_lightning_indexer.py
mm_res_ub_uint8 = T.alloc_ub((VECTOR_BASEG, VECTOR_BASEN), "uint8")
In AscendC, they provide the GetReduceSumMaxMinTmpSize API for estimating the size to be allocated for AscendC::ReduceXXX APIs.
However, currently this API is not accessible in Tilelang-Ascend.