CUDA Programming 🦕🌊

My CUDA Programming learning journey.

Basics

The basic flow in CUDA programming is,

Initialize data from CPU.
Transfer data from CPU to GPU.
Lauch necessary kernel executions on data.
Tranfer data from GPU to CPU.
Release memory from CPU and GPU.

Terms usage

Term	Meaning
SISD	Single Instruction Single Data
SIMD	Single Instruction Multiple Data
MISD	Multiple Instruction Single Data
MIMD	Multiple Instruction Multiple Data
SIMT	Single Instruction Multiple Threads

Term	Meaning
Host	CPU
Device	GPU
SM	Streaming Multiprocessor

General Explanation

Essentially, Host runs sequential operations and Device runs parallel operations.
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. Refer thebeardsage/cuda-streaming-multiprocessors for detailed explanation.
Multiple thread blocks can execute on same single Streaming Multiprocessor, but one thread block cannot execute on multiple Streaming Multiprocessors.
The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block. Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively. Unlike the maximum threads per block, there is not a blocks per grid limit distinct from the maximum grid dimensions.

Warps

A warp is the basic unit of execution in a cuda program. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the Streaming Multiprocessor.
If a set of threads execute different instruction compared to other threads of a warp, then warp divergence occurs. This can reduce performance of the cuda program.

Lane

A lane is just a thread in a block. Each lane in a thread block is indexed by a number in range 0 to 31. Each lane in a warp is unique, but multiple threads in a thread block can have same lane index. (in a 1D block, laneid = threadidx.x % 32)

Dynamic Parallelism

Early CUDA programs have been designed in a way that GPU workload was completely in control of Host thread. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU.
CUDA 5.0 introduced Dynamic Parallelism, which makes it possible to launch kernels from threads running on the device; threads can launch more threads. An application can launch a coarse-grained kernel which in turn launches finer-grained kernels to do work where needed. This avoids unwanted computations while capturing all interesting details.
This reduces the need to transfer control and data between host and GPU device.
Kernel executions are classified into Parent and Child grids. Parent grid start execution and dispatches some workload to child grid. Parent grid end the execution when kernel execution is complete. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size.
Grid launches in a device thread is visible across all threads in the thread block. Execution of a thread block is not complete untill all child threads created in the block are complete.
Grids launched with dynamic parallelism are fully nested. This means that child grids always complete before the parent grids that launch them, even if there is no explicit synchronization

CUDA Memory Management

Refer CUDA Runtime API documentation for details.

Registers (On-Chip)

Registers are fast on-chip memories that are used to store operands for the operations executed by the computing cores.
In general all scalar variables defined in CUDA code are stored in registers.
Registers are local to a thread, and each thread has exclusive access to its own registers. Values in registers cannot be accessed by other threads, even from the same block, and are not available for the host. Registers are also not permanent, therefore data stored in registers is only available during the execution of a thread.
Register Spills: If a kernel uses more registers than the hardware limit, the excess registers will spill over to local memory causing performance deterioration.

Shared Memory (On-Chip)

On-chip memory shared/partitioned among thread blocks. Lifetime is lifetime of execution of the thread block.
Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block.
It's only useful when data needs to be accessed more than once, either within the same thread or from different threads within the same block.
Shared memory requests are issued per warp. Shared memory is divided into 32 equally sized memory banks, because size of a warp is 32 threads.
Shared memory access modes: 32bit, 64bit.
Issue: Shared Memory bank conflict can occur if not properly managed. That is, if consecutive multiple memory access hit same memory bank.

Local Memory

Variables that cannot be stored in register space are stored in local memory. Memory that cannot be decided at compile time are stored in local memory.
Memory can also be statically allocated from within a kernel, and according to the CUDA programming model such memory will not be global but local memory.
Local memory is only visible, and therefore accessible, by the thread allocating it. So all threads executing a kernel will have their own privately allocated local memory.

Constant Memory

Constant memory is used for storing data that will not change over the course of kernel execution. It supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location.
It is better used when all threads in a warp access the same memory location. Lifetime is lifetime of the program.
Access latency to constant memory is considerably faster than global memory because constant memory is cached but unlike global memory, constant memory cannot be written to from within the kernel.
As device can only read from constant memory, the data must be initialized from the host i.e. as global variable.

There are other types of memory: Global, Constant, Texture. Refer CUDA Memory Model for details.

Pinned memory

CUDA uses DMA to transfer pinned memory to GPU device. Pageable memory cannot be directly transfered to device. So, first it's copied to pinned (page-locked) memory and then copied to GPU device.
Pinned to Device transfer is faster than Pageable to Device transfer.
cudaMallocHost and cudaFreeHost functions can be used to allocate pinned memory directly.
Refer Nvidia blog for details.

Global Memory Access Patterns

Refer Nvidia blog.
Refer Medium article.
First memory acccess is L1 cache access (termed as normal cached memory access). When memory request comes to L1 cache, and L1 cache misses, then the request will be sent to L2 cache. If L2 cache misses, then the request will be sent to DRAM. Memory load that doesn't use L1 cache are referred to as un-cached memory acccess.
If L1 cache line is used, then the memory is served in 128 bytes segment.
If L2 cache is only used, then the memory is served in 32 bytes segment.

Global Memory Store

In memory write, only L2 cache is used. It is divided into 32 bytes segment.

Array of Structures (AOS) vs. Structure of Arrays (SOA)

Refer Wiki for explanation on AOS and SOA.
In CUDA programming, SOA is preferred over AOS for global memory efficiency. This because, in SOA, the array is stored in coalesced fashion reducing number of memory transactions.

Partition Camping

Partition camping occurs when global memory accesses are directed through a subset of partitions, causing requests to queue up at some partitions while other partitions go unused.
Since partition camping concerns how active thread blocks behave, the issue of how thread blocks are scheduled on multiprocessors is important.

Warp Shuffle

It is a mechanism to read a thread's register by another thread, when both threads are within the same block. This ensures that there is no explicit copy between thread register and global memory/shared memory.
This method does not consume extra memory to share/exchange data, and it is much faster than using shared memory.

Images

Schematic

Software vs. Hardware perspective

Tools

NVIDIA Nsight Systems
NVIDIA Nsight™ Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs, from large servers to our smallest system on a chip (SoC).
NVIDIA Nsight Compute
NVIDIA® Nsight™ Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. In addition, its baseline feature allows users to compare results within the tool. Nsight Compute provides a customizable and data-driven user interface and metric collection and can be extended with analysis scripts for post-processing results.

Repository Parts

Number	Repository	Description
1	Hello World	Programmer's induction. Hello World from GPU.
2	Print	Print ThreadIdx, BlockIdx, GridDim.
3	Addition	Perform addition operation on GPU.
4	Add Arrays	Perform addition of three arrays on GPU.
5	Global Index	Calculate Global Index for any dimensional grid and any dimensional block.
6	Device properties	Print some GPU device properties.
7a	Reduce Sum with Loop Unroll	Perform reduction sum operation with loop unroll in GPU kernel.
7b	Reduce Sum with Warp Unroll	Perform reduction sum operation with warp unroll in GPU kernel. Solution for warp divergence.
7c	Reduce Sum with Complete Unroll	Perform reduction sum operation with completely unrolled loop and using Shared Memory in GPU kernel.
8	Coalesced vs. Un-Coalesced memory pattern	TODO
9	Matrix Transpose	Perform Matrix transpose in different fashions. 1. Row Major 2. Column Major 3. Unrolled Loop 4. Diagonal Coordinates 5. Shared Memory
10	Static and Dynamic Shared Memory	TODO
11	Warp Shuffle	Perform Warp Shuffle on 1D data. 1. __shfl_sync 2. __shfl_up_sync 3. __shfl_down_sync 4. __shfl_xor_sync 5. Reduce Sum with Warp Shuffle loop unrolling
12	TODO	TODO

Terms


Streaming Multiprocessor	Grid	Thread Block	Thread	Warp	Kernel	_syncthread	Occupancy	Shared memory	Registers
Dynamic parallelism	Parallel reduction	Parent	Child	Temporal locality	Spatial locality	Coalesced memory pattern	Un-Coalesced memory pattern	L1 Cache	L2 Cache
Constant Memory	Warp Shuffle

References

Happy Learning! 😄

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
11_warp_shuffle		11_warp_shuffle
1_hello_world		1_hello_world
2_print		2_print
3_add		3_add
4_add_arrays		4_add_arrays
5_gid_calculation		5_gid_calculation
6_device_properties		6_device_properties
7_reduction_loop_warp_complete		7_reduction_loop_warp_complete
9_matrix_transpose		9_matrix_transpose
images		images
README.md		README.md

Logeswaran123/CUDA-Programming

Folders and files

Latest commit

History

Repository files navigation

CUDA Programming 🦕🌊

Basics

Terms usage

General Explanation

Warps

Lane

Dynamic Parallelism

CUDA Memory Management

Registers (On-Chip)

Shared Memory (On-Chip)

Local Memory

Constant Memory

Pinned memory

Global Memory Access Patterns

Global Memory Store

Array of Structures (AOS) vs. Structure of Arrays (SOA)

Partition Camping

Warp Shuffle

Images

Schematic

Software vs. Hardware perspective

Tools

Repository Parts

Terms

References

About

Topics

Resources

Stars

Watchers

Forks

Languages