Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
batched_gemm_fp64.cu		batched_gemm_fp64.cu
block_io.hpp		block_io.hpp
blockdim_gemm_fp16.cu		blockdim_gemm_fp16.cu
common.hpp		common.hpp
common_nvrtc.hpp		common_nvrtc.hpp
flops.h		flops.h
fused_gemm_performance.cu		fused_gemm_performance.cu
gemm_fft.cu		gemm_fft.cu
gemm_fft_fp16.cu		gemm_fft_fp16.cu
gemm_fft_performance.cu		gemm_fft_performance.cu
gemm_fusion.cu		gemm_fusion.cu
introduction_example.cu		introduction_example.cu
multiblock_gemm.cu		multiblock_gemm.cu
nvrtc_gemm.cpp		nvrtc_gemm.cpp
reduce.hpp		reduce.hpp
reference.hpp		reference.hpp
scaled_dot_prod_attn.cu		scaled_dot_prod_attn.cu
scaled_dot_prod_attn_batched.cu		scaled_dot_prod_attn_batched.cu
simple_gemm_cfp16.cu		simple_gemm_cfp16.cu
simple_gemm_fp32.cu		simple_gemm_fp32.cu
simple_gemm_leading_dimensions.cu		simple_gemm_leading_dimensions.cu
simple_gemm_std_complex_fp32.cu		simple_gemm_std_complex_fp32.cu
single_gemm_performance.cu		single_gemm_performance.cu
single_gemm_performance.hpp		single_gemm_performance.hpp

README.md

cuBLASDx Library - API Examples

All example, including more advanced onces, are shipped within cuBLASDx package.

This folder demonstrates cuBLASDx APIs usage.

You may specify CUBLASDX_CUDA_ARCHITECTURES to limit CUDA architectures used for compilation (see CMake:CUDA_ARCHITECTURES)
mathdx_ROOT - path to mathDx package (XX.Y - version of the package)

mkdir build && cd build
cmake -DCUBLASDX_CUDA_ARCHITECTURES=70-real -Dmathdx_ROOT=/opt/nvidia/mathdx/XX.Y ..
make
// Run
ctest

For the detailed descriptions of the examples please visit Examples section of the cuBLASDx documentation.

Group	Subgroup	Example	Description
Introduction Examples		introduction_example	cuBLASDx API introduction example

Simple GEMM Examples	Basic Example	simple_gemm_fp32	Performs fp32 GEMM
		simple_gemm_cfp16	Performs complex fp16 GEMM

	Extra Examples	simple_gemm_leading_dimensions	Performs GEMM with non-default leading dimensions
		simple_gemm_std_complex_fp32	Performs GEMM with cuda::std::complex as data type

NVRTC Examples		nvrtc_gemm	Performs GEMM, kernel is compiled using NVRTC

GEMM Performance		single_gemm_performance	Benchmark for single GEMM
		fused_gemm_performance	Benchmark for 2 GEMMs fused into a single kernel

Advanced Examples	Fusion	fused_gemm	Performs 2 GEMMs in a single kernel
		gemm_fft	Perform GEMM and FFT in a single kernel
		gemm_fft_fp16	Perform GEMM and FFT in a single kernel (half-precision complex type)
		gemm_fft_performance	Benchmark for GEMM and FFT fused into a single kernel

	Deep Learning	scaled_dot_prod_attn	Scaled dot product attention using cuBLASDx
		scaled_dot_prod_attn_batched	Multi-head attention using cuBLASDx

	Other	multiblock_gemm	Proof-of-concept for single large GEMM using multiple CUDA blocks
		batched_gemm_fp64	Manual batching in a single CUDA block
		blockdim_gemm_fp16	BLAS execution with different block dimensions