Hi everybody, today, I am going to show you how to efficiently implement bitonic sort with CUDA. The reason why I am doing this is just because when I started learning CUDA, implementing bitonic sort was my homework and at that time,it was really difficult for me to implement it. The version I have made was several times slower than bubble sort on the CPU :D … So I have decided to make a new version, lets see how can we proceed :)

At first,you of course have to know,what a bitonic sequence is. Its a sequence with just 1 minimum and 1 maximum.So for example 2,7,10,15,13,9,8,1 is a nice bitonic sequence.Also note that each 2 random numbers create a bitonic sequence of length = 2. (1 min + 1max). The key of sorting an arbitrary sequence with bitonic sort lies in creating bitonic sequences of larger size in each step. So basically we start with Bt2 and continue to Bt4,Bt8,Bt16 and so on. A very nice property of bitonic sequence is that it can be “quite” easily implemented in parallel. To understand how it works a flowchart is exactly what we need:

The input is the same as the example on this site. However I have made an update to its flow to make it clear how it works. At first, the grey squares indicates, that its an input to a thread,so for example in stage 0 & substage 0, threads 0-7 will be operating od data with indexes: (0,2,4,6,8,10,12,14). These indexes are the left indexes. Each thread will also consume 1 value with a distance of Nb (stands for neighbor) to the right. Now that each thread has 2 values to compare, it has to decide wheather its creating an incresing sequence (I) or decreasing sequence (D). An example: a thread operates on values 7,13 and has to create an increasing sequence. Since 7 is on the left and 13 on the right,its already an increasing sequence,so the thread will just skip this step. Furthermore,if you look carefully,we have a Bt2 sequence at the very beginning. After passing stage 0,we obtain a Bt4 and after passing stage 1,we obtain Bt8 … so of course after passing stage2,we have a bitonic sequence of length = 16,which is exactly what we need :)

Now you may be wondering how can we tell each thread wheather it has to create a decreasing or increasing sequence and how can we tell the thread to use the correct input index :) … Well lets take a look at it in general:

This table represents the I/D property of each thread during each Stage. Note that we are talking about Stage and not Substage,which will be the subject to the next table. You can lookup the flowchart, if your not sure wheather its really that simple. The mapping function looks like this:

for i from 0 to 7 do if ((i mod (2^Stage*2))<(2^Stage)) then print("I") end if; if ((i mod (2^Stage*2))>=(2^Stage)) then print("D") end if; end do;

The codewriting is from Maple 16, since its a bit more diffucult in CUDA … You can see, that we have solved the I/D problem with just two if statements :) Now back to the indexing. Each thread will be processing items based on Substage and not Stage. Again a table represents the global view:

This is a bit more diffucult. We got Substages, but in fact the thing we need is only Nb. The Table itself represents Stage 2 in the flowchart at the beginning. Given Nb = 1,we obtain indexing suitable for Stage 0. To make it clear Nb corresponds exactly to exp2(Stage) or 2^Stage. At each Substage, we divide the Nb by 2 so we obtain different Nb = 4,2,1 for each Substage. By the way I hope you have already realized that the number of Substages increases as the Stage numbering increases :) … Again,to solve the indexing problem, we have a special function:

for i from 0 to 8 do (i mod (Nb)) + (iquo(i,Nb)*(2*Nb)) end do;

- This on the other side can be easier written in CUDA …. the iquo() function represents how many times can we deduct a number without using the remainder. So for example iquo(7,2) =3. In CUDA its just i/Nb …

Now that we know everything,we can finally put some code in it ;) … To make this implementation fast,I have decided to use shared memory,which results in a more complicated code in general (But its still OK :) ). We will be launching 2 kernels. The first one will use the shared memory while the second one will not. The reason for this lies in the flowchart. If you look back, we can say,that a well specified number of threads works on the same data during some Stages and Substages. In the example,we have 8 running threads,but all of them operates on indexes 0-15. This is why we can preload the input to shared memory,use __syncthreads() and make the computation faster :) I will not represent the CUDA-Host sauce here, so the first Kernel is as folllows:

//Shared memory kernel - responsible for creating bitonic sequences of length 512. __global__ void BitonicKernel(int *input, int *additional){ //Init int glTid = threadIdx.x + blockDim.x*blockIdx.x*2; //Global Tid,note the *2 at the end. (As the index needs to be corrected due to "missing" threads from 256-511,768-1023 and so on) int tid = threadIdx.x; //Local Tid int bx = blockDim.x; __shared__ int cache[512]; //Load and sync cache[tid] = input[glTid]; cache[tid+bx] = input[glTid+bx]; __syncthreads(); for(int Stage = 0; Stage<8;Stage++){ int Nb = (int)(exp2((double)Stage)); for(int Substage = 0;Substage<=Stage;Substage++){ //Map threads //int index = (tid % Nb) + (tid/Nb)*(Nb*2); int index = (int)fmod((float)tid,(float)Nb) + (tid/Nb)*(Nb*2); int exp2St = (int)exp2((double)Stage); if(fmod((float)tid,(float)(exp2St*2)) < exp2St){ //Increasing int left = cache[index]; int right = cache[index+Nb]; if(left<right){ //This is ok }else if(left > right){ //Swap cache[index] = right; cache[index+Nb] = left; } } if(fmod((float)tid,(float)(exp2St*2)) >= exp2St){ //Decreasing int left = cache[index]; int right = cache[index+Nb]; if(left > right){ //This is ok }else if(left < right){ //swap cache[index] = right; cache[index+Nb] = left; } } //At the end of each substep Nb = Nb/2; __syncthreads(); } } //Results back to global memory input[glTid] = cache[tid]; input[glTid+bx] = cache[tid+bx]; }

As you were expecting, we are loading input into the shared memory at the beginning and syncing the load, note that each thread has to load 2 values. 2 for loops follow. They of course represent Stages and Substages. In the inner loops you can see everything we were talking about before. I/D statements and the index function. I have added some comments to make it clear what the code does. At the end of each substep, we are dividing Nb by 2 and syncing the threads again. Once the loops are finished, we write our results back to the main memory. Now after launching this kernel, we have a Bt512 sequence :) … Or better Sequences – if N was 1024, we would get two Bt512 sequences and so on.

But usually N is larger than 512 and we will be forced to launch another Kernel :

__global__ void BitonicKernelLarge(int *input,int *additional){ //Init int tid = threadIdx.x + blockDim.x*blockIdx.x; int Nb = additional[0]; int Stage = additional[1]; int Substage = additional[2]; int index = (int)fmod((float)tid,(float)Nb) + (tid/Nb)*(Nb*2); int exp2St = (int)exp2((double)Stage); //Do the comparsion if(fmod((float)tid,(float)(exp2St*2)) < exp2St){ //Increasing int left = input[index]; int right = input[index+Nb]; if(left<right){ //This is ok }else if(left > right){ //Swap input[index] = right; input[index+Nb] = left; } } if(fmod((float)tid,(float)(exp2St*2)) >= exp2St){ //Decreasing int left = input[index]; int right = input[index+Nb]; if(left > right){ //This is ok }else if(left < right){ //swap input[index] = right; input[index+Nb] = left; } } }

This one is not that complex. We dont have any loops, nor we have the shared memory. Each thread will just read 3 values regarding the Nb,Stage and Substage information and make some comparsions on the input data based on these variables. The code itself is however not that simple on the Host side because we have done something I call the ,,Put the sync part on the host” … Lets take a look at the host code then:

//Launch the first kernel - shared memory kernel BitonicKernel<<<BLOCKS_PER_GRID,THREADS_PER_BLOCK>>>(dev_input,dev_additional); //Launch Other kernels for completing the process for(int Stage = 8;Stage<17;Stage++){ host_additional[0] = (int)(exp2f((float)Stage)); //Nb for(int Substage = 0;Substage <= Stage;Substage++){ host_additional[1] = Stage; //Stage host_additional[2] = Substage; //Substage cudaMemcpy(dev_additional,host_additional,7*sizeof(int),cudaMemcpyHostToDevice); BitonicKernelLarge<<<BLOCKS_PER_GRID,THREADS_PER_BLOCK>>>(dev_input,dev_additional); host_additional[0] = host_additional[0]/2; } }

You can see we are calling the first kernel at the beginning. But then, we have 2 loops, that were previously in the first kernel. They just cant stay in the GPU code duo to sync problems, so we have transfered them to the host. At each step, we update the additional information the threads will need, then we copy the variables to the GPU memory and launch the kernel. This results in lots of kernel launches, but I believe its a good way to achieve a very good performance. You can note that the host loops start at Stage = 8, which means we have saved lots of data transfer and kernel calls with the first kernel :) The last Stage sorts the entire bitonic sequence.

- And as usual the source files: VS2012.