You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thrust β CUDA β PTX β SASS ποΈββοΈποΈββοΈ (#25)
* Add: Thrust, CUB, CUDA sorting
This is a draft. It still lacks manual
timing and async scheduling.
* Add: Thrust, CUB, CUDA sorting
This is a draft. It still lacks manual
timing and async scheduling.
* Make: Options for CUDA & TBB in CMake
* Make: Switch to CUDA Toolkit for GPU libs
* Fix: Ranges require `constexpr` on NVCC
* Make: Upgrade `fmt` for NVCC builds
fmtlib/fmt#4297
* Fix: NVCC compilation issues
* Make: Silence NVCC warnings
* Add: Sorting with `thrust` and `cub`
* Add: PTX and `.cuh` kernels
* Make: Don't compile PTX
* Add: Using CUDA Driver API to JIT `.ptx`
* Add: Precompiled CUDA C++ kernels
* Add: cuBLAS benchmarks
* Fix: Compiling `cuBLAS` calls
* Fix: Avoid optimizing-out SASS code
Unless we put an impossible condition with
a `wmma::store_matrix_sync` the result of
fragment multiplication is optimized out.
* Add: Tensor Core intrinsic benchmarks
Targeting `f16`, `bf16`, `tf16`, `f32`, `f64`
on Volta, Turing, and Ampere.
* Make: Build CUDA for multiple platforms
Currently covering Volta, Turing, Ampere,
Ada Lovelace, and Hopper.
* Add: Binary BMMA kernels for GPU
XOR variant for Turing+.
AND variant for Ampere+.
* Docs: Introduce Warp-Group-MMA on Hopper
* Fix: Working PTX kernel
* Fix: Lower PTX version for JIT
* Fix: Use `f16` MMA
* Make: Drop OpenBLAS
0 commit comments