remember starting my first class in CUDA programming. I was really excited, so many new information about hardware, threads and some grids … I really wanted to understand whats happening out there in the GPU core, but I had lots of other work unfortunately. My first homework was to implement a SCAN operation on GPU using CUDA-C. I am laughing all the time, when I remember what my code looked like, it was a nightmare … And was everything but a good, efficient and usefull code. In fact the GPU CUDA implementations was about 10 times slower than the CPU version …
Recently, I have been studying at udacity and you know, I dont like homeworks … But I decided to even the score by implemnting a good SCAN algorithm :) For those who have never met SCAN, its an operation on one array. The trick is that each output element is a sum of the previous elements in the input array. (input = {0,1,2,3,4,5,6 …} –> output ={0,1,3,6,10,15,21 …}). There are two major versions of this algorithm : INCLUSIVE and EXCLUSIVE SCAN. Today, we will talk about the first one and how to implement it on the GPU using the Hillis-Steele algorithm. Let us see how it works:
At first, take a look at the number of steps. Its a log[2](N) dependence, therefore we need 3 steps to calculate SCAN of 8 elements. Now how it works, its easy. We begin with a variable ,,space” = 1. During each iteration,we multiply this variable by 2. Now this is a very usefull variable, beause if we have ,,N” threads,each thread can do the correct operation: if ( ,,threadIdx” < space) –> Just rewrite last result for next step. If ( ,,threadIdx” > space) –> add an element positioned at [threadIdx – space] to the current one. Continue to next step.
Now this is really very easy, but note, that we have to synchronize writing and reading from memory. This is a problem in many algorithms (FFT, Bitonic Sort, SCAN …) … As you know we can only synchronize withing a block, so we can calculate a SCAN of up to 1024 elements without any problems:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
__global__ void Kernel(int *input, int *output) { int tid = threadIdx.x; //USE SHARED MEMORY - COMON WE ARE EXPERIENCED PROGRAMMERS __shared__ int Cache[1024]; Cache[tid] = input[tid]; __syncthreads(); int space = 1; //BEGIN for(int i = 0;i<10;i++){ int temp = Cache[tid]; int neighbor = 0; if((tid-space)>= 0){ neighbor = Cache[tid-space]; } __syncthreads(); //AFTER LOADING if(tid < space){ //DO NOTHING }else{ Cache[tid] = temp + neighbor; } space = space*2; __syncthreads(); } //REWRITE RESULTS TO MAIN MEMORY output[tid] = Cache[tid]; } |
And of course,because accessing main memory takes too much time, we use shared memory instead to implement the problem more effectively :) …. There is still however one big problem: GPU Algorithms calculate effectively, when they have huge amounts of data to process … And for sure, 1024 elements is simply not enough and therefore is better to leave the calculation to the CPU. On the other hand i like to use the GPU :) Furthermore, I decided to implement this algorithm to be able to handle these huge amounts of data unlike in my previous post, where we have been limited by 65536 elements :) The trick is in using 2D grid, I mean, when we will have X and Y dimensions of the running blocks :
1 2 3 4 5 6 7 |
dim3 THREADS_PER_BLOCK(1024,1,1); dim3 BLOCKS_PER_GRID; if(N>65536){ BLOCKS_PER_GRID = dim3(64,N/65536,1); }else{ BLOCKS_PER_GRID = dim3(N/1024,1,1); } |
This is just a simple modification added to our host code, however we actually need to know how to synchronize the memory reading/writing. I am curious about this for a long time,but yet I havent found anything better than simply using 2(thats enough) or 3 memory arrays and calling multiple times a kernel with different arguments. This means something like ,,Put the synchronization part on the Host“ … It is a good method I believe and I have updated it from last time (FFT long arrays) because i do not need to copy the whole array back to input after the work is finished. Instead during each step, we simply replace input and output (Step 0: read from input,write to output. Step 1: read from output,write to input …):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
__global__ void Kernel(int *input, int *output,int* results, int space, int step, int steps, bool Direction) { int tix = threadIdx.x + blockDim.x * blockIdx.x; int tiy = threadIdx.y + blockDim.y * blockIdx.y; int tid = tix + tiy*gridDim.x*blockDim.x; // space = GAP BETWEEN NEIGHBORS // direction = COPY FROM INPUT TO OUTPUT OR FROM OUTPUT TO INPUT // step = CURRENT PROCESSING STEP // steps = TOTAL AMOUNT OF STEPS NEEDED TO CALCULATE SCAN int res = 0; if(Direction){ //OUTPUT -> INPUT if(tid<space){ res = output[tid]; //ONLY REWRITE TO CORRECT MEMORY ADDRESS input[tid] = res; }else{ res = output[tid] + output[tid-space]; input[tid] = res; } }else{ //INPUT -> OUTPUT if(tid<space){ res = input[tid]; //ONLY REWRITE TO CORRECT MEMORY ADRESS output[tid] = res; }else{ res = input[tid] + input[tid-space]; output[tid] = res; } } //THE FINAL STEP: WRITE RESULTS INTO CORRECT LOCATION ON GPU if(step == (steps-1)){ results[tid] = res; } } |
- We also need a minor modification to the host code as I have said already:
1 2 3 4 5 6 7 8 |
//LAUNCH KERNELS IN LOOP int space = 1; int steps = static_cast<int>(log2(static_cast<float>(N))); for (size_t step = 0; step<steps; step++){ bool direction = (step % 2) !=0; Kernel << <BLOCKS_PER_GRID, THREADS_PER_BLOCK >> >(dev_input, dev_output, dev_results, space, direction, step, steps); space = space * 2; } |
The most important thing in implementing an algorithm is to always make a picture how it works and think about it for a while :) Thats really all. Have a good day and feel free to email me back at : vojta.ters@beechwood.eu … you can alternatively join the Intro to parallel Computing at udacity or post some comments :)
EDIT: 18.4.2016
The old code is a mess and can be easily replaced with a modern version (See last link that implements the SCAN on large arrays).The Major Modifications includes:
- No additional variables during BLOCK/THREAD Initialization
- Automatic calculation of the number of steps
- Kernel Calls with parameters rather than memcpy(4*int)
- CPU Version of the same algorithm added for comparison
- More user-Friendly For-loops, results displaying and program end
- Malloc has been replaced with “new” operator
- Project 1024 SCAN Download VS 2012
- Project LongArray SCAN Download VS 2012
- Project LongArrayX SCAN Download VS2013 CUDA 7.0