This is not a problem to implement on any CPU, we only need 2 cycles to loop dimensions of the input data and another 2 cycles to calculate the 2D convolution. As we see, the output data are independent on any there, therefore this proccess can be effectively implemented using CUDA. At first, here is our basic Kernel limited only by the throughput of the GPU memory:

__global__ void Gaussian(unsigned char *input,unsigned char *output,int *addittional) { int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int z = threadIdx.z + blockIdx.z * blockDim.z; int offset = x + y * blockDim.x * gridDim.x; int bloff = threadIdx.y * blockDim.x + threadIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; int bdx = blockDim.x; int bdy = blockDim.y; int width = addittional[0]; int height = addittional[1]; float conv_res = 0; //MAIN CONVOLUTION PART //////////////////// NEW ////////////////////////// if (offset < (4*width * height)){ int ii = 0; int jj = 0; for(int i = -DIM/2;i<(DIM/2);i++){ for(int j = -DIM/2;j<(DIM/2);j++){ int temp = 0; //DEFAULT VALUE WHEN OUT-OF-RANGE int new_x = x+j; int new_y = i+y; if((new_y<=height) && (new_y >= 0) && (new_x >=0 ) && (new_x <= width)){ //OUT-OF-RANGE? int new_offset = new_y*width + new_x; temp = input[4*new_offset+z]; } conv_res +=temp*matrix[ii][jj]; jj++; } jj = 0; ii++ ; } //RENEW USED VARIABLES AND SAVE tx = threadIdx.x; ty = threadIdx.y; output[4*offset+z] = (unsigned char) round(conv_res); } //////////////////// NEW ////////////////////////// }

Now, as you can see, we get an input image represented by char values (0-255) and other info about the image, which we may need: Image width and Image height. At first step, we cretae a lot of variables, that we may use after (A good reader will notice, that we are not using all of them) … Well then we are checking, if a thread can compute a new pixel value, if so, the thread calculates a 2D convolution within the 2 loops (note that DIM is odd :1, 3 ,5 ,7 , ….). Now back to our mission, you may wonder why there is something like DIM / 2 … thats becase if we have a pixel at [x][y] posittion, we need to calculate the convolution from point [x-DIM/2][y-DIM/2]. However there is a possibility, that these indices might be out of range of our image, so we need to check, that these new coordinates match our image dimensions (Thats the long ,,if”). Because we may otherwise encounter this situation: First pixel at posittion [0][0], begin convolution from: [0-something][0-something] … as you can see, we get a negative index, which means that the tread will try to read from a not specified location in the memory therefore resulting in an application error. Furthermore there is a question about what to do with the convolution process on borders of our input data as there is not enough elements. To solve this, we simply declare ,,0″ where we cant calculate the convolution. This has an impact on the borders itself, as they appear darker.

Now you may ask, how to impement this more effectively. We have two options: Use Texture Memory instead of Global memory or use Shared memory. Well, it may be interesting to compare these two methods, but i like to use shared memory. Therefore i decided to store the pixel values in the shared memory. The problem is how to transfer a small part of the image into the shared memory …. I finally decided to create a bigger field of pixel values (chars), than we need. I have also firmly set the number of threads per block to 16×16 = 256 threads. This is not exactly the best option, but it works pretty good. Now i set up the temporary shared field with these dimensions: 4*THREADS x 4*THREADS (Threads = 16). I will fill it with zeros and then I am going to load pixel values in it. This is however a problem, because I have only 256 threads and I want to fill a field of 4096 elements. Therefore each thread needs to load 16 values. After that, we can calculate a convolution up to this dimension: THREADS + THREADS/2 +1 = 25. If we need a bigger dimension, we have to set another constants such as 32*32 Threads per block and __shared__ temp[ (4-6) * THREADS][(4-6) * Threads]; But take care, shared memory is a of a small size (48Kb on devices with Compute Capability 2.0 or later) if we use [6*32][6*32] = 36.8Kb which is still ok and the Kernel can handle it, but [7*32][7*32]=50Kb! therefore the kernel will be shutted down immediately after launch and we dont get any result or warning, that the application failed.

global void Gaussian(){ __shared__ unsigned char temp[4*THREADS][4*THREADS]; int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int z = threadIdx.z + blockIdx.z * blockDim.z; int offset = x + y * blockDim.x * gridDim.x; int bloff = threadIdx.y * blockDim.x + threadIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; int bdx = blockDim.x; int bdy = blockDim.y; int width = addittional[0]; int height = addittional[1]; float conv_res = 0; //FILL TEMP WITH ZEROS for(int i =0;i<4;i++){ for(int j=0;j<4;j++){ temp[ty][tx] =(unsigned char) 0; tx += bdx; } tx = threadIdx.x; ty += bdy; } tx = threadIdx.x; ty = threadIdx.y; __syncthreads(); //Load pixels . . . if (offset < (width * height)){ tx += THREADS + THREADS/2 - DIM/2; ty += THREADS + THREADS/2 - DIM/2; for(int i = 0;i<DIM;i++){ for(int j = 0;j<DIM;j++){ conv_res += matrix[i][j] * temp[ty+i][tx+j]; } } //RENEW USED VARIABLES tx = threadIdx.x; ty = threadIdx.y; } }

- Now as everything is clear you may wonder whats the convolution matrix and hot to get it :) … In fact, its simple Gaussian Curve in 2D:

The creation of the convolution matrix is simple and it should be made on the side of CPU and then transfered to constant memory of the device :)

**Sourcecode for Visual Studio 2010:**- CPU Version
- GPU Version
- GPU Version Optimized