Select Page
I

remember starting my first class in CUDA programming. I was really excited, so many new information about hardware, threads and some grids … I really wanted to understand whats happening out there in the GPU core, but I had lots of other work unfortunately. My first homework was to implement a SCAN operation on GPU using CUDA-C. I am laughing all the time, when I  remember what my code looked like, it was a nightmare  … And was everything but a good, efficient and usefull code. In fact the GPU CUDA implementations was about 10 times slower than the CPU version …

Recently, I have been studying at udacity and you know, I dont like homeworks  … But I decided to even the score by implemnting a good SCAN algorithm :) For those who have never met SCAN, its an operation on one array. The trick is that each output element  is a sum of the previous elements in the input array. (input = {0,1,2,3,4,5,6 …} –> output ={0,1,3,6,10,15,21 …}). There are two major versions of this algorithm : INCLUSIVE and EXCLUSIVE SCAN. Today, we will talk about the first one and how to implement it on the GPU using the Hillis-Steele algorithm. Let us see how it works:

At first, take a look at the number of steps. Its a log[2](N) dependence, therefore we need 3 steps to calculate SCAN of 8 elements. Now how it works, its easy. We begin with a variable ,,space” = 1. During each iteration,we multiply this variable by 2. Now this is a very usefull variable, beause if we have ,,N” threads,each thread can do the correct operation: if ( ,,threadIdx” < space) –> Just rewrite last result for next step.  If ( ,,threadIdx” > space) –> add an element positioned at [threadIdx – space] to the current one. Continue to next step.

Now this is really very easy, but note, that we have to synchronize writing and reading from memory. This is a problem in many algorithms (FFT, Bitonic Sort, SCAN …) … As you know we can only synchronize withing a block, so we can calculate a SCAN of up to 1024 elements without any problems:

```__global__ void Kernel(int *input, int *output)
{

//USE SHARED MEMORY - COMON WE ARE EXPERIENCED PROGRAMMERS
__shared__ int Cache[1024];
Cache[tid] = input[tid];
int space = 1;

//BEGIN
for(int i = 0;i<10;i++){
int temp = Cache[tid];
int neighbor = 0;
if((tid-space)>= 0){
neighbor = Cache[tid-space];
}

if(tid < space){
//DO NOTHING
}else{
Cache[tid] = temp + neighbor;
}

space = space*2;
}

//REWRITE RESULTS TO MAIN MEMORY
output[tid] = Cache[tid];
}```

And of course,because accessing main memory takes too much time, we use shared memory instead  to implement the problem more effectively :) …. There is still however one big problem: GPU Algorithms calculate effectively, when they have huge amounts of data to process …  And for sure, 1024 elements is simply not enough and therefore is better to leave the calculation to the CPU. On the other hand i like to use the GPU :) Furthermore, I decided to implement this algorithm to be able to handle these huge amounts of data unlike in my previous post, where we have been limited by 65536 elements :) The trick is in using 2D grid, I mean, when we will have X and Y dimensions of the running blocks :

```        dim3 THREADS_PER_BLOCK(1024,1,1);
dim3 BLOCKS_PER_GRID;
if(N>65536){
BLOCKS_PER_GRID = dim3(64,N/65536,1);
}else{
BLOCKS_PER_GRID = dim3(N/1024,1,1);
}```

This is just a simple modification added to our host code, however we actually need to know how to synchronize the memory reading/writing. I am curious about this for a long time,but yet I havent found anything better than simply using 2(thats enough) or 3 memory arrays and calling multiple times a kernel with different arguments. This means something like ,,Put the synchronization part on the Host … It is a good method I believe and I have updated it from last time (FFT long arrays) because i do not need to copy the whole array back to input after the work is finished. Instead during each step, we simply replace input and output (Step 0: read from input,write to output. Step 1: read from output,write to input …):

```__global__ void Kernel(int *input, int *output,int* results, int space, int step, int steps, bool Direction)
{
int tix = threadIdx.x + blockDim.x * blockIdx.x;
int tiy = threadIdx.y + blockDim.y * blockIdx.y;
int tid = tix + tiy*gridDim.x*blockDim.x;
// space =      GAP BETWEEN NEIGHBORS
// direction =  COPY FROM INPUT TO OUTPUT OR FROM OUTPUT TO INPUT
// step =       CURRENT PROCESSING STEP
// steps =      TOTAL AMOUNT OF STEPS NEEDED TO CALCULATE SCAN

int res = 0;
if(Direction){
//OUTPUT -> INPUT
if(tid<space){
res  = output[tid]; //ONLY REWRITE TO CORRECT MEMORY ADDRESS
input[tid] = res;
}else{
res = output[tid] + output[tid-space];
input[tid] = res;
}
}else{
//INPUT -> OUTPUT
if(tid<space){
res = input[tid]; //ONLY REWRITE TO CORRECT MEMORY ADRESS
output[tid] = res;
}else{
res = input[tid] + input[tid-space];
output[tid] = res;
}

}

//THE FINAL STEP: WRITE RESULTS INTO CORRECT LOCATION ON GPU
if(step == (steps-1)){
results[tid] = res;
}
}```
•  We also need a minor modification to the host code as I have said already:
```	//LAUNCH KERNELS IN LOOP
int space = 1;
int steps = static_cast<int>(log2(static_cast<float>(N)));
for (size_t step = 0; step<steps; step++){
bool direction = (step % 2) !=0;
Kernel << <BLOCKS_PER_GRID, THREADS_PER_BLOCK >> >(dev_input, dev_output, dev_results, space, direction, step, steps);
space = space * 2;
}```

The most important thing in implementing an algorithm is to always make a picture how it works and think about it for a while :) Thats really all. Have a good day and feel free to email me back at : [email protected]  … you can alternatively join the Intro to parallel Computing at udacity or post some comments :)

EDIT: 18.4.2016

The old code is a mess and can be easily replaced with a modern version (See last link that implements the SCAN on large arrays).The Major Modifications includes: