I

 remember starting my first class in CUDA programming. I was really excited, so many new information about hardware, threads and some grids … I really wanted to understand whats happening out there in the GPU core, but I had lots of other work unfortunately. My first homework was to implement a SCAN operation on GPU using CUDA-C. I am laughing all the time, when I  remember what my code looked like, it was a nightmare  … And was everything but a good, efficient and usefull code. In fact the GPU CUDA implementations was about 10 times slower than the CPU version …

Recently, I have been studying at udacity and you know, I dont like homeworks  … But I decided to even the score by implemnting a good SCAN algorithm :) For those who have never met SCAN, its an operation on one array. The trick is that each output element  is a sum of the previous elements in the input array. (input = {0,1,2,3,4,5,6 …} –> output ={0,1,3,6,10,15,21 …}). There are two major versions of this algorithm : INCLUSIVE and EXCLUSIVE SCAN. Today, we will talk about the first one and how to implement it on the GPU using the Hillis-Steele algorithm. Let us see how it works:

At first, take a look at the number of steps. Its a log[2](N) dependence, therefore we need 3 steps to calculate SCAN of 8 elements. Now how it works, its easy. We begin with a variable ,,space” = 1. During each iteration,we multiply this variable by 2. Now this is a very usefull variable, beause if we have ,,N” threads,each thread can do the correct operation: if ( ,,threadIdx” < space) –> Just rewrite last result for next step.  If ( ,,threadIdx” > space) –> add an element positioned at [threadIdx – space] to the current one. Continue to next step.

Now this is really very easy, but note, that we have to synchronize writing and reading from memory. This is a problem in many algorithms (FFT, Bitonic Sort, SCAN …) … As you know we can only synchronize withing a block, so we can calculate a SCAN of up to 1024 elements without any problems: 

__global__ void Kernel(int *input, int *output)
{
    int tid = threadIdx.x;

	//USE SHARED MEMORY - COMON WE ARE EXPERIENCED PROGRAMMERS
	__shared__ int Cache[1024];
	Cache[tid] = input[tid];
	__syncthreads();
	int space = 1;

	//BEGIN 
	for(int i = 0;i<10;i++){
		int temp = Cache[tid];
		int neighbor = 0;
		if((tid-space)>= 0){
			neighbor = Cache[tid-space];
		}
		__syncthreads(); //AFTER LOADING

		if(tid < space){
			//DO NOTHING
		}else{
			Cache[tid] = temp + neighbor; 
		}

		space = space*2;
		__syncthreads();
	}

	//REWRITE RESULTS TO MAIN MEMORY
	output[tid] = Cache[tid];
}

And of course,because accessing main memory takes too much time, we use shared memory instead  to implement the problem more effectively :) …. There is still however one big problem: GPU Algorithms calculate effectively, when they have huge amounts of data to process …  And for sure, 1024 elements is simply not enough and therefore is better to leave the calculation to the CPU. On the other hand i like to use the GPU :) Furthermore, I decided to implement this algorithm to be able to handle these huge amounts of data unlike in my previous post, where we have been limited by 65536 elements :) The trick is in using 2D grid, I mean, when we will have X and Y dimensions of the running blocks :

        dim3 THREADS_PER_BLOCK(1024,1,1);
	dim3 BLOCKS_PER_GRID;
	if(N>65536){
		BLOCKS_PER_GRID = dim3(64,N/65536,1);
	}else{
		BLOCKS_PER_GRID = dim3(N/1024,1,1);
	}

This is just a simple modification added to our host code, however we actually need to know how to synchronize the memory reading/writing. I am curious about this for a long time,but yet I havent found anything better than simply using 2(thats enough) or 3 memory arrays and calling multiple times a kernel with different arguments. This means something like ,,Put the synchronization part on the Host … It is a good method I believe and I have updated it from last time (FFT long arrays) because i do not need to copy the whole array back to input after the work is finished. Instead during each step, we simply replace input and output (Step 0: read from input,write to output. Step 1: read from output,write to input …):

__global__ void Kernel(int *input, int *output,int* results, int space, int step, int steps, bool Direction)
{
	int tix = threadIdx.x + blockDim.x * blockIdx.x; 
	int tiy = threadIdx.y + blockDim.y * blockIdx.y;
	int tid = tix + tiy*gridDim.x*blockDim.x;
	// space =      GAP BETWEEN NEIGHBORS
	// direction =  COPY FROM INPUT TO OUTPUT OR FROM OUTPUT TO INPUT
	// step =       CURRENT PROCESSING STEP
	// steps =      TOTAL AMOUNT OF STEPS NEEDED TO CALCULATE SCAN

	int res = 0;
	if(Direction){
		//OUTPUT -> INPUT
		if(tid<space){
			res  = output[tid]; //ONLY REWRITE TO CORRECT MEMORY ADDRESS
			input[tid] = res;
		}else{
			res = output[tid] + output[tid-space];
			input[tid] = res;
		}
	}else{
		//INPUT -> OUTPUT
		if(tid<space){
			res = input[tid]; //ONLY REWRITE TO CORRECT MEMORY ADRESS
			output[tid] = res;
		}else{
			res = input[tid] + input[tid-space];
			output[tid] = res;
		}

	}

	//THE FINAL STEP: WRITE RESULTS INTO CORRECT LOCATION ON GPU
	if(step == (steps-1)){
		results[tid] = res;
	}
}
  •  We also need a minor modification to the host code as I have said already:
	//LAUNCH KERNELS IN LOOP
	int space = 1;
	int steps = static_cast<int>(log2(static_cast<float>(N)));
	for (size_t step = 0; step<steps; step++){
		bool direction = (step % 2) !=0;  
		Kernel << <BLOCKS_PER_GRID, THREADS_PER_BLOCK >> >(dev_input, dev_output, dev_results, space, direction, step, steps);
		space = space * 2;
	}

 

The most important thing in implementing an algorithm is to always make a picture how it works and think about it for a while :) Thats really all. Have a good day and feel free to email me back at : [email protected]  … you can alternatively join the Intro to parallel Computing at udacity or post some comments :)

EDIT: 18.4.2016

The old code is a mess and can be easily replaced with a modern version (See last link that implements the SCAN on large arrays).The Major Modifications includes:

  • No additional variables during BLOCK/THREAD Initialization
  • Automatic calculation of the number of steps
  • Kernel Calls with parameters rather than memcpy(4*int)
  • CPU Version of the same algorithm added for comparison
  • More user-Friendly For-loops, results displaying and program end
  • Malloc has been replaced with “new” operator