CUDA Progress
Day | Code Summary |
---|---|
Day 1 | CUDA set up and kernel that prints "Hello World" |
Day 2 | CUDA kernel that adds two vectors |
Day 3 | Adding matrices |
Day 4 | Vector addition using cuBLAS |
Day 5 | Naive matmul |
Day 6 | Tiled matmul using shared memory |
Day 7 | Naive 1D convolution with boundary checks |
Day 8 | Matrix multiplication using cuBLAS |
Day 9 | Matrix Transpose |
Day 10 🥳 | Naive Softmax |
Day 11 | Softmax using shared memory and reductions |
Day 12 | Softmax using warp shuffle functions |
Day 13 | 1D complex-to-complex fourier transform using cuFFT |
Day 14 | Naive layer normalization |
Day 15 | Optimizing layer norm using shared memory |
Day 16 | Optimizing layer norm using warp shuffle functions |
Day 17 | Optimizing layer norm using vectorized loads |
Day 18 | Tiled 1D convolution and halo cells |
Day 19 | 1D convolution using L2 cache |
Day 20 🥳 | Blog Post: Optimizing Layer Normalization with CUDA |
Day 21 | Simple self attention |
Day 22 | Optimizing self attention |
Day 23 | Causal attention with masking |
Day 24 | Causal attention + torch binding |
Day 25 | Multi-head attention |
Day 26 | Parallel add using koggle stone algorithm |
Day 27 | MHA debug |
Day 28 | Flash Attention 1 (algorithm 1) Forward pass |
Day 29 | Flash Attention 1 (algorithm 1) Forward pass continued |
Day 30 🥳 | Flash Attention 1 (algorithm 1) Forward pass |
Day 31 | HGEMV matvec using fp16 |
Day 32 | HGEMV matvec using Bfloat16 |
Day 33 | Matmul using Tensor cores |
Day 34 | Swizzle patterns on matrix transpose |
Day 35 | Swizzled matrix transpose using Tensor Memory Accelerators |
Day 36 | Brent Kung Parallel Scan algorithm |
Day 37 | Matvec using integer fixed point arithmetic |
Day 38 | Transfered 1D array from gmem->smem->gmem using TMA |
Day 39 | Memory Coalesced layernorm + revisited Flash attention |
Day 40 🥳 | revisited Flash Attention 1 |
Day 41 | Flash Attention 1 |
Day 42 | Flash Attention 1 |
Day 43 | ReLU Activation - FP32, FP32x4, FP16, FP16x2 vectorized |
Day 44 | Overlapping data transfers using CUDA Streams (Vector add) |
Day 45 | ReLU using CUDA Streams + benchmarked |
Day 46 | Packed 128 bit ReLU FP16x8 kernel |
Day 47 | Sparse matrix-vector mul (spMV) |
Day 48 | Sparse padded matrix-vector mul |
Day 49 | RoPE Kernel: Rotary Position Embedding naive fp32 |
Day 50 🥳 | Optimized RoPE using vectorized loads and half precision (18x) |
Day 51 | Flash Attention 2 Forward |
Day 52 | Flash Attention 2 Forward |
Day 53 | Flash Attention 2 Forward |
Day 54 | Gaussian Elimination |
Day 55 | PTX vector add kernel |
Day 56 | GELU activatation naive fp32 kernel |
Day 57 | GELU activation vectorized |
Day 58 | Backward pass kernel for Relu activation |
Day 59 | Backward pass kernel for GELU activation |
Day 60 🥳 | LeetGPU challenge - reduction |
Day 61 | Optimize + benchmarked gelu kernels |
Day 62 | Micrograd in CUDA |
Day 63 | Micrograd in CUDA |
Day 64 | Micrograd in CUDA |
Day 65 | Micrograd in CUDA |
Day 66 | Optimized Sigmoid activation |
Day 67 - Day 70 🥳 | Micrograd in CUDA |
Day 71 | Sigmoid with half precision |
Day 72 | Sigmoid with fp16 vectorized |
Day 73 | Swish kernel |
Day 74 | Swish kernel vectorized |
Day 75 | AMD hip kernel intro + vector add kernel |
Day 76 | Revisiting gemm optimizations |
Day 77 | Gemm coalesced |
Day 78 | fp16 swish |
Day 79 | AMD competition fp8 gemm & swish optimizations |
Day 80 🥳 | AMD competition fp8 gemm optimizations |