remember starting my first class in CUDA programming. I was really excited, so many new information about hardware, threads and some grids … I really wanted to understand whats happening out there in the GPU core, but I had lots of other work unfortunately. My first homework was to implement a SCAN operation on GPU using CUDA-C. I am laughing all the time, when I  remember what my code looked like, it was a nightmare  … And was everything but a good, efficient and usefull code. In fact the GPU CUDA implementations was about 10 times slower than the CPU version …

Recently, I have been studying at udacity and you know, I dont like homeworks  … But I decided to even the score by implemnting a good SCAN algorithm :) For those who have never met SCAN, its an operation on one array. The trick is that each output element  is a sum of the previous elements in the input array. (input = {0,1,2,3,4,5,6 …} –> output ={0,1,3,6,10,15,21 …}). There are two major versions of this algorithm : INCLUSIVE and EXCLUSIVE SCAN. Today, we will talk about the first one and how to implement it on the GPU using the Hillis-Steele algorithm. Let us see how it works:

At first, take a look at the number of steps. Its a log[2](N) dependence, therefore we need 3 steps to calculate SCAN of 8 elements. Now how it works, its easy. We begin with a variable ,,space” = 1. During each iteration,we multiply this variable by 2. Now this is a very usefull variable, beause if we have ,,N” threads,each thread can do the correct operation: if ( ,,threadIdx” < space) –> Just rewrite last result for next step.  If ( ,,threadIdx” > space) –> add an element positioned at [threadIdx – space] to the current one. Continue to next step.

Now this is really very easy, but note, that we have to synchronize writing and reading from memory. This is a problem in many algorithms (FFT, Bitonic Sort, SCAN …) … As you know we can only synchronize withing a block, so we can calculate a SCAN of up to 1024 elements without any problems: 

And of course,because accessing main memory takes too much time, we use shared memory instead  to implement the problem more effectively :) …. There is still however one big problem: GPU Algorithms calculate effectively, when they have huge amounts of data to process …  And for sure, 1024 elements is simply not enough and therefore is better to leave the calculation to the CPU. On the other hand i like to use the GPU :) Furthermore, I decided to implement this algorithm to be able to handle these huge amounts of data unlike in my previous post, where we have been limited by 65536 elements :) The trick is in using 2D grid, I mean, when we will have X and Y dimensions of the running blocks :

This is just a simple modification added to our host code, however we actually need to know how to synchronize the memory reading/writing. I am curious about this for a long time,but yet I havent found anything better than simply using 2(thats enough) or 3 memory arrays and calling multiple times a kernel with different arguments. This means something like ,,Put the synchronization part on the Host … It is a good method I believe and I have updated it from last time (FFT long arrays) because i do not need to copy the whole array back to input after the work is finished. Instead during each step, we simply replace input and output (Step 0: read from input,write to output. Step 1: read from output,write to input …):

  •  We also need a minor modification to the host code as I have said already:


The most important thing in implementing an algorithm is to always make a picture how it works and think about it for a while :) Thats really all. Have a good day and feel free to email me back at : [email protected]  … you can alternatively join the Intro to parallel Computing at udacity or post some comments :)

EDIT: 18.4.2016

The old code is a mess and can be easily replaced with a modern version (See last link that implements the SCAN on large arrays).The Major Modifications includes:

  • No additional variables during BLOCK/THREAD Initialization
  • Automatic calculation of the number of steps
  • Kernel Calls with parameters rather than memcpy(4*int)
  • CPU Version of the same algorithm added for comparison
  • More user-Friendly For-loops, results displaying and program end
  • Malloc has been replaced with “new” operator