Skip to content

Commit

Permalink
New register assignment logic
Browse files Browse the repository at this point in the history
-Implemented radix codelets up to 47.
-Implemented composite radix codelets for arbitrary composite stage sizes.
-Implemented new register assignment logic, aimed at optimizing shared memory transfers, register usage and warp utilization.
-Performance improvements for all system sizes - please report regressions if they happen (especially for vendors other than Nvidia and AMD).
-All double pointers passed to VkFFT now make local copy of their contents (#184, #185)
-Fixed locale setting for code generator (vincefn/pyvkfft#38)
  • Loading branch information
DTolm committed Sep 23, 2024
1 parent 9a96811 commit ae94053
Showing 1 changed file with 9 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -424,15 +424,16 @@ static inline VkFFTResult VkFFTOptimizeRaderFFTRegisters(VkFFTRaderContainer* ra
}
}
}*/

for (pfINT j = 2; j < 68; j++) {
if (raderContainer[i].registers_per_thread_per_radix[j] != 0) {
double scaling = (raderContainer[i].containerFFTDim > raderContainer[i].registers_per_thread_per_radix[j]) ? pfceil(raderContainer[i].containerFFTDim / (double)raderContainer[i].registers_per_thread_per_radix[j]) : 1.0 / floor(raderContainer[i].registers_per_thread_per_radix[j] / (double)raderContainer[i].containerFFTDim);
while (((int)pfceil(fftDim / (double)min_registers_per_thread[0])) < (raderContainer[i].containerFFTNum * scaling)) {
raderContainer[i].registers_per_thread_per_radix[j] += (int)j;
scaling = (raderContainer[i].containerFFTDim > raderContainer[i].registers_per_thread_per_radix[j]) ? pfceil(raderContainer[i].containerFFTDim / (double)raderContainer[i].registers_per_thread_per_radix[j]) : 1.0 / floor(raderContainer[i].registers_per_thread_per_radix[j] / (double)raderContainer[i].containerFFTDim);
if (numRaderPrimes>1){
for (pfINT j = 2; j < 68; j++) {
if (raderContainer[i].registers_per_thread_per_radix[j] != 0) {
double scaling = (raderContainer[i].containerFFTDim > raderContainer[i].registers_per_thread_per_radix[j]) ? pfceil(raderContainer[i].containerFFTDim / (double)raderContainer[i].registers_per_thread_per_radix[j]) : 1.0 / floor(raderContainer[i].registers_per_thread_per_radix[j] / (double)raderContainer[i].containerFFTDim);
while (((int)pfceil(fftDim / (double)min_registers_per_thread[0])) < (raderContainer[i].containerFFTNum * scaling)) {
raderContainer[i].registers_per_thread_per_radix[j] += (int)j;
scaling = (raderContainer[i].containerFFTDim > raderContainer[i].registers_per_thread_per_radix[j]) ? pfceil(raderContainer[i].containerFFTDim / (double)raderContainer[i].registers_per_thread_per_radix[j]) : 1.0 / floor(raderContainer[i].registers_per_thread_per_radix[j] / (double)raderContainer[i].containerFFTDim);
}
if (raderContainer[i].registers_per_thread_per_radix[j] > raderContainer[i].registers_per_thread) raderContainer[i].registers_per_thread = raderContainer[i].registers_per_thread_per_radix[j];
}
if (raderContainer[i].registers_per_thread_per_radix[j] > raderContainer[i].registers_per_thread) raderContainer[i].registers_per_thread = raderContainer[i].registers_per_thread_per_radix[j];
}
}
if (raderContainer[i].registers_per_thread > registers_per_thread[0]) registers_per_thread[0] = raderContainer[i].registers_per_thread;
Expand Down

0 comments on commit ae94053

Please sign in to comment.