- Support FP32 and BF16. - Block layout first, then extend to plain layout. - Default config to cover shapes used in LLAMA2.