Name	Name	Last commit message	Last commit date
parent directory ..
Makefile	Makefile
README.md	README.md
launch.cu	launch.cu
level_01.cu	level_01.cu
level_02.cu	level_02.cu
level_03.cu	level_03.cu
level_04.cu	level_04.cu
level_05.cu	level_05.cu
level_06.cu	level_06.cu
level_07.cu	level_07.cu
level_08.cu	level_08.cu

Name

Last commit message

Last commit date

ThunderKittens Educational GEMM Kernels

This folder builds up the H100 GEMM piece-by-piece. It is only for educational purposes.

Change the LEVEL field in the Makefile to 01 - 08, then make clean && make run.

Level 01: Simple for loop (float) -- this is faster than bf16 because bf16 gets implicitly converted to floats first on cuda cores
Level 02: Simple for loop (bf16)
Level 03: Use shared memory
Level 04: Use tensor cores (WMMA)
Level 05: Use tensor cores (WGMMA)
Level 06: Use tensor memory accelerator (TMA) and double buffering
Level 07: Use work partitioning
Level 08: Use multiple consumer warpgroups

What would you want to add next for peak performance?

Provide feedback