Release Less Slow v0.6: Thrust → CUDA → PTX → SASS 🏋️‍♂️🏋️‍♀️ · ashvardanian/less_slow.cpp

It's almost impossible to imagine modern High-Performance Computing without GPUs. Yet, there are surprisingly few "full stack" demos out there for folks wanting to build intuition around CUDA C++, PTX Intermediate Representations, SASS Assembly, and higher-level libraries like Thrust, CUB, or the various cuBLAS flavors. This new release of Less Slow covers all of those! 🥳

Tensor Cores

The main highlight is an in-depth look at Tensor Core designs, from their extensive type system to the complexity of tile shapes—notoriously under-documented and confusing areas. These capabilities differ across Volta, Turing, Ampere, Ada, and Hopper GPUs, mapping to different PTX intrinsics (like wmma, binary bmma, or warp-group wgmma) and culminating in yet another shape at the SASS level with instructions such as multiple HMMA.884.F32.F32.STEPx instructions for each wmma.mma.sync.aligned.row.col.m16n16k16.f32.f32 intrinsic on Volta. And if you believe that instruction is long... be warned 😅

__global__ void tops_f16f16_sm70tc_16x16x16_1024unroll_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, 16,16,16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16,16,16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16,16,16, half> c_frag;
    for (int i = 0; i < 1024; ++i)
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}

$ cuobjdump -sass less_slow_from_cu.cubin | grep -i mma
# e.g. HMMA.884.F32.F32.STEP2 ...

This indicates the 8×8×4 shape actually used by the hardware on Volta.

PTX vs SASS

I've also hand-written PTX kernels, that may look like:

.visible .entry tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel()
{
  // ...
  loop_start:
    // A single wmma instruction
    wmma.mma.sync.aligned.row.col.m16n16k16.f16.f16
      { %f0, %f1, %f2, %f3 }, // output accumulators
      { %f4, ... },          // A
      { %f12, ... },         // B
      { %f0, %f1, %f2, %f3 }; // input accumulators
    // ...
  bra loop_start;
}

Using the provided scripts, you can see for yourself just how different manually written vs. machine-generated PTX can be and how to invoke kernels directly from C++ in various ways — whether through the CUDA Runtime API or the CUDA Driver API — loading and JIT-compiling bits of PTX on the fly!

cuInit(0);
CUdevice dev; cuDeviceGet(&dev, 0);
CUcontext ctx; cuCtxCreate(&ctx, 0, dev);
CUmodule mod; cuModuleLoad(&mod, "less_slow.ptx");
CUfunction fun; cuModuleGetFunction(&fun, mod, "tops_f16f16_sm70tc_16x16x16_1024loop_ptx_kernel");

void* args[] = { /* kernel parameters here */ };
cuLaunchKernel(fun,
               1, 1, 1,  // gridDim
               256, 1, 1,// blockDim
               0, nullptr, args, nullptr);
cuCtxSynchronize();
cuModuleUnload(mod);
cuCtxDestroy(ctx);

cuBLAS on Practice

I've also included theoretical throughput benchmarks alongside real matrix multiplications via cuBLAS in case you want to compare actual performance to the raw theoretical numbers. One important observation here may be the lack of low-resolution numeric types:

if constexpr (std::is_same_v<scalar_type_, float>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, double>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, __half>) {
    scalar_type_ alpha = 1, beta = 0;
    cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha, a.begin(), lda, b.begin(), ldb, &beta, c.begin(), ldc);
} else if constexpr (std::is_same_v<scalar_type_, int8_t>) {
    int32_t alpha_int = 1, beta_int = 0;
    cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, &alpha_int, a.begin(), CUDA_R_8I, lda, b.begin(), CUDA_R_8I, ldb, &beta_int, c.begin(), CUDA_R_32I, ldc, CUDA_R_32I, CUBLAS_GEMM_DEFAULT);
}

Even integer kernels have a different signature, requiring $Alpha$ and $Beta$ to match the accumulator type, rather than the inputs. Very few libraries have adaptations for binary matrices and or sub-byte representations.

Beyond Linear Algebra

Since GPUs obviously go beyond linear algebra, Thrust and CUB are perfect for exploring other domains in heterogeneous computing. I’ve added snippets that mostly revolve around sorting algorithms, showcasing the differences in memory management between Thrust and CUB and explaining why CUB calls often come in pairs, like:

size_t temp_size = 0;
void *d_temp = nullptr;
cub::DeviceRadixSort::SortKeys(nullptr, temp_size, d_in_keys, d_out_keys, count);
cudaMalloc(&d_temp, temp_size);
cub::DeviceRadixSort::SortKeys(d_temp, temp_size, d_in_keys, d_out_keys, count);

This was also a good place to show how Thrust and CUB operations can be scheduled together on the same asynchronous streams and profiled with GPU time instead of CPU time to avoid unnecessary blocking ⏲️

Enjoy exploring, and happy GPU hacking! I’ll keep adding to this project (and other related ones) as we go along!

Changelog

Add: Binary BMMA kernels for GPU (6a609a0)
Add: Tensor Core intrinsic benchmarks (1bdb5df)
Add: cuBLAS benchmarks (2f791fe)
Add: Precompiled CUDA C++ kernels (c1a6f3e)
Add: Using CUDA Driver API to JIT .ptx (82cb684)
Add: PTX and .cuh kernels (824e473)
Add: Sorting with thrust and cub (df3b2c1)
Add: Thrust, CUB, CUDA sorting (551402d)
Add: Thrust, CUB, CUDA sorting (8481114)
Make: Drop OpenBLAS (3c92c36)
Fix: Use f16 MMA (141d285)
Fix: Lower PTX version for JIT (eff3854)
Fix: Working PTX kernel (514db0f)
Docs: Introduce Warp-Group-MMA on Hopper (400f294)
Make: Build CUDA for multiple platforms (3283ab0)
Fix: Avoid optimizing-out SASS code (986b8bc)
Fix: Compiling cuBLAS calls (312409a)
Make: Don't compile PTX (53202e6)
Make: Silence NVCC warnings (a6cdc74)
Fix: NVCC compilation issues (494e705)
Make: Upgrade fmt for NVCC builds (88277bf)
Fix: Ranges require constexpr on NVCC (c1d7b2f)
Make: Switch to CUDA Toolkit for GPU libs (2589a40)
Make: Options for CUDA & TBB in CMake (4d03c08)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Less Slow v0.6: Thrust → CUDA → PTX → SASS 🏋️‍♂️🏋️‍♀️