Skip to content

seymagorucu/Histogram_with_CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Histogram_with_CUDA

Firstly, basic histogram and kogge-stone scan algorithm are executed. Basic histogram function doesn’t use shared memory. Also atomicAdd is used because atomicAdd provide read-modify-write operation performed by single hardware instruction on a memory location address. So atomicAdd prevent data race in paralel thread execution. Then, assigned each thread to enhance the content of an scan element.Compile-time constant SECTION_SIZE is defined for size of a section. SECTION_SIZE is used as the block size of the kernel initialization, so I've had an equal number of threads and partition elements. Then made final adjustments to these cross-sectional scanning results for large input sequences. Histo[] array has all the threads in the block to load array elements together into a common memory array scan[] .Barrier synchronization is used to allow all threads to repeat their current insertions in before starting the next iteration.Also cdf min is find to calculate histogram equalization. At the end of the kernel, each thread writes its result to the assigned output array scanning[]. Then histogram equalize is calculated.

My device is GeForce 940MX, warp size is 32. My SECTION_SIZE is 256. 256 / 32 = 8 warps are used.

Then, private histogram and brent kunt scan algorithm are executed. These two algorithms run faster than the previous one. The private histogram use shared memory. Private histogram provides much less contention and serialization for accessing both private copies and the final copy. Therefore, it improves performance.Since the Brent-Kung algorithm always uses consecutive threads in each iteration, the control deviation problem does not occur until the number of active threads falls below the warp size. This can increase the efficiency of the algorithm. Then cdf min is find to calculate histogram equalization. At the end of the kernel, each thread writes its result to the assigned output array scanning[]. Then histogram equalize is calculated.

So in the second part, using more efficient codes for both histogram and scan has accelerated the process.