Multiple buffer input/output support. AMD TLB fix

-Now VkFFT supports input/output data to be stored in multiple buffers. passed as a pointer to the array of buffers with a number of buffers. VkFFT will use memory consecutively, splitting data logically in the chunks of the smallest buffer size. This allows to use data split between different memory allocations and mitigate 4GB single allocation limit. Sample 10 shows how this works -VkFFT is now able to mitigate TLB buffer misses on big sequences. This was detrimental for AMD GPUs performance before - up to 5x performance gains for big systems there. This is done by logical split of input/output buffer in 16KB chunks if sequence spans more than 2MB. Two parameters: localPageSize and devicePageSize control these two parameters. Also combinable with multiple buffers update -Updated performance plots
DTolm · Nov 26, 2020 · 54c6a55 · 54c6a55
1 parent 95f69e9
commit 54c6a55
Show file tree

Hide file tree

Showing 240 changed files with 14,758 additions and 7,641 deletions.
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ VkFFT is an efficient GPU-accelerated multidimensional Fast Fourier Transform li
   - Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros)
   - WHDCN layout - data is stored in the following order (sorted by increase in strides): the width, the height, the depth, the coordinate (the number of feature maps), the batch number
   - Multiple feature/batch convolutions - one input, multiple kernels
+  - Multiple input/output/temporary buffer split. Allows to use data split between different memory allocations and mitigate 4GB single allocation limit.
   - Works on Nvidia, AMD and Intel GPUs (tested on Nvidia RTX 3080, GTX 1660 Ti, AMD Radeon VII and Intel UHD 620)
   - Header-only (+precompiled shaders) library with Vulkan interface, which allows to append VkFFT directly to user's command buffer
 ## Future release plan
@@ -40,7 +41,7 @@ VkFFT has a command-line interface with the following set of commands:\
 -devices: print the list of available GPU devices\
 -d X: select GPU device (default 0)\
 -o NAME: specify output file path\
--vkfft X: launch VkFFT sample X (0-9)\
+-vkfft X: launch VkFFT sample X (0-10)\
 -cufft X: launch cuFFT sample X (0-3) (if enabled in CMakeLists.txt)\
 -test: (or no other keys) launch all VkFFT and cuFFT benchmarks\
 So, the command to launch single precision benchmark of VkFFT and cuFFT and save log to output.txt file on device 0 will look like this on Windows:\
@@ -51,16 +52,17 @@ VkFFT.h is a library which can append FFT, iFFT or convolution calculation to th
 VkFFT achieves striding by grouping nearby FFTs instead of transpositions.
 ![alt text](https://github.com/dtolm/VkFFT/blob/master/FFT_memory_layout.png?raw=true)
 ## Benchmark results in comparison to cuFFT
-To measure how Vulkan FFT implementation works in comparison to cuFFT, we will perform a number of 1D, 2D and 3D tests, ranging from the small systems to the big ones. The test will consist of performing C2C FFT and inverse C2C FFT consecutively multiple times to calculate average time required. The results are obtained on Nvidia RTX 3080 and AMD Radeon VII graphics cards with no other GPU load. Launching -test key from Vulkan_FFT.cpp performs VkFFT/cuFFT benchmark. The overall benchmark score is calculated as an averaged performance score over presented set of systems (the bigger - the better): sum(system_size/average_iteration_time) /num_benchmark_samples
+To measure how Vulkan FFT implementation works in comparison to cuFFT, we will perform a number of 1D, 2D and 3D tests, ranging from the small systems to the big ones. The test will consist of performing C2C FFT and inverse C2C FFT consecutively multiple times to calculate average time required. The results are obtained on Nvidia RTX 3080, AMD Radeon VII and AMD Radeon 6800XT graphics cards with no other GPU load. Launching -test key from Vulkan_FFT.cpp performs VkFFT/cuFFT benchmark. The overall benchmark score is calculated as an averaged performance score over presented set of systems (the bigger - the better): sum(system_size/average_iteration_time) /num_benchmark_samples
 
-The stable flat lines present on RTX 3080 graph indicate that time scales linearly with the system size on Nvidia GPUs, so the bigger the bandwidth the better the result will be. The stepwise drops occur once the amount of transfers increases from to 2x and to 3x when compute unit can't hold full sequence and splits it into combination of smaller ones. Radeon VII is faster than RTX 3080 below 2^17 due to it having HBM2 memory with a higher bandwidth, however, this GPU apparently has TLB miss problems on large buffer sizes. On RTX 3080, VkFFT is faster than cuFFT in single precision batched 1D FFTs on the whole range from 2^7 to 2^28:
+The stable flat lines present for small sequence lengths indicate that time scales linearly with the system size, so the bigger the bandwidth the better the result will be. The stepwise drops occur once the amount of transfers increases from to 2x and to 3x when compute unit can't hold full sequence and splits it into combination of smaller ones. Radeon VII is faster than RTX 3080 below 2^18 (=2MB - page file size on AMD due to it having HBM2 memory with a higher bandwidth, however, this GPU apparently has TLB miss problems on large buffer sizes. The TLB problems are solved with logical split of buffer in 16KB chunks if sequence spans more than 2MB. Note: benchmark of 6800XT were on a version without a TLB fix. On RTX 3080, VkFFT is faster than cuFFT in single precision batched 1D FFTs on the whole range from 2^7 to 2^28:
 ![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_single.png?raw=true)
-In double precision Radeon VII is able to get advantage due to its high double precision core count:
+In double precision Radeon VII is able to get advantage due to its high double precision core count. Radeon RX 6800XT can store LUT in L3 cache and has higher double precision core count as well:
 ![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_double.png?raw=true)
 In half precision mode, VkFFT only uses it for data storage, all computations are performed in single.It still proves to be enough to get stable 2x performance gain on RTX 3080: 
 ![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_half.png?raw=true)
-Native support for zero-padding allows to transfer less data and get up to 3x performance boost in multidimensional FFTs:
-![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_zeropadding.png?raw=true)
+Multidimensional systems are optimized as well. Benchmark shows Radeon RX 6800XT can store systems up to 128MB in L3 cache for big performance gains. Native support for zero-padding allows to transfer less data and get up to 3x performance boost in multidimensional FFTs:
+![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_2d.png?raw=true)
+![alt text](https://github.com/DTolm/VkFFT/blob/master/vkfft_benchmark_3d.png?raw=true)
 ## Precision comparison of cuFFT/VkFFT/FFTW
 To measure how VkFFT (single/double precision) results compare to cuFFT (single/double precision) and FFTW (double precision), a set of ~50 systems covering full FFT range was filled with random complex data on the scale of [-1,1] and one C2C transform was performed on each system. The precision_cuFFT_VkFFT_FFTW.cu script in benchmark_precision_scripts folder contains the comparison code, which calculates for each value of the transformed system: