T

oday,I decided to show you how to implement a Gaussian blur and further optimize it to run more effectively using nVidia CUDA.

At first, you might have wondered how a blur really works. Its very simple, though we need to know what is a convolution, better said:  2D convolution. We will show it on an example, lets have 2 matrixes of the same dimension. A convolution is such a function, that for each element of the input data, we calcalute a new one like this (Hope self-explanatory):

 


Conv1Conv2

 

This is not a problem to implement on any CPU, we only need 2 cycles to loop dimensions of the input data and another 2 cycles to calculate the 2D convolution. As we see, the output data are independent on any there, therefore this proccess can be effectively implemented using CUDA. At first, here is our basic Kernel limited only by the throughput of the GPU memory:

Now, as you can see, we get an input image represented by char values (0-255) and other info about the image, which we may need: Image width and Image height. At first step, we cretae a lot of variables, that we may use after (A good reader will notice, that we are not using all of them) … Well then we are checking, if a thread can compute a new pixel value, if so, the thread calculates a 2D convolution within the 2 loops (note that DIM is odd  :1, 3 ,5 ,7 , ….). Now back to our mission, you may wonder why there is something like DIM / 2 … thats becase if we have a pixel at [x][y] posittion, we need to calculate the convolution from point [x-DIM/2][y-DIM/2]. However there is a possibility, that these indices might be out of range of our image, so we need to check, that these new coordinates match our image dimensions (Thats the long ,,if”). Because we may otherwise encounter this situation: First pixel at posittion [0][0], begin convolution from: [0-something][0-something] … as you can see, we get a negative index, which means that the tread will try to read from a not specified location in the memory therefore resulting in an application error. Furthermore there is a question about what to do with the convolution process on borders of our input data  as there is not enough elements.  To solve this, we simply declare ,,0″ where we cant calculate the convolution. This has an impact on the borders itself, as they appear darker.

 

Now you may ask, how to impement this more effectively. We have two options: Use Texture Memory instead of Global memory or use Shared memory. Well, it may be interesting to compare these two methods, but i like to use shared memory. Therefore i decided to store the pixel values in the shared memory. The problem is how to transfer a small part of the image into the shared memory …. I  finally decided to create a bigger field of pixel values (chars), than we need. I have also firmly set the number of threads per block to 16×16 = 256 threads. This is not exactly the best option, but it works pretty good. Now i set up the temporary shared  field  with these dimensions: 4*THREADS x 4*THREADS (Threads = 16). I will fill it with zeros and then I am going to load pixel values in it. This is however a problem, because I have only 256 threads and I want to fill a field of 4096 elements. Therefore each thread needs to load 16 values. After that, we can calculate a convolution up to this dimension: THREADS + THREADS/2 +1 = 25. If we need a bigger dimension, we have to set another constants such as 32*32 Threads per block and __shared__ temp[ (4-6) * THREADS][(4-6) * Threads]; But take care, shared memory is a of a small size (48Kb on devices with Compute Capability 2.0 or later) if we use [6*32][6*32] = 36.8Kb which is still ok and the Kernel can handle it, but [7*32][7*32]=50Kb! therefore the kernel will be shutted down immediately after launch and we dont get any result or warning, that the application failed.

  • Now as everything is clear you may wonder whats the convolution matrix and hot to get it :) … In fact, its simple Gaussian Curve in 2D:

Curve

 

The creation of the convolution matrix is simple and it should be made on the side of CPU and then transfered to constant memory of the device :)