Maybe, you have heard of them already, libraries for CUDA: CUBLAS, CuFFT, Thrust and others as well … These Libraries mainly implements functions and algorithms, that have been directly optimized by nVidia employees for highest performace. CUBLAS stands for: NVIDIA CUDA Basic Linear Algebra Subroutine. CuFFT is a library for performing Fast Fourier Transform  (FFT) and Thurst is an universal library, where you can find for example optimized Parallel Reduction and other famous algorithms. (If you would like to learn more about Thrust, note that is mainly used in the book: CUDA Application design and Developement).

Anyway, we are going to use CuFFT Today. As we have seen in many code examples,the classics routine stands for allocating GOU memory,filling host data,trasporting the data to the GPU,launching kernel and then copying back the results from the GPU. We will be using CuFFT in similar manner, the only difference is, that your data are stored in a struct: ,,cufftComplex“, where you can specifiy the .x (real) and .y (imag) values for your calculation. There is also no Kernel Launching. All you have to do is to create a Plan, which can has 1,2,or 3 dimensions and then exetuce the calculation using ,,cufftExecC2C()” where C2C stands for Complex to Complex Fourier Transform. In Most applications,you will however use only R2C (Real to Complex). A good thing is, that you can specify the direction of the calculation : cufftExecC2C(plan,dev_data,dev_data,CUFFT_FORWARD); Here,CUFFT_FORWARD means a classic forward transformation. You can change it to CUFFT_INVERSE to obtain you signal values.

 As you see, its quite simple. The only thing that needs a bit of explanation is creating the plan. You can look at that line the way, that CUDA simply needs to know the lenght of input samples (DIM) and then create an appropriate plan for the input data. It also examines you hardware and select the best and fastest method (There can be a slight difference when using for example shorter input arrays). The Last ,,1″ at the end tells CUDA,that you have only 1 input array. In case you will have 2 arrays of length DIM,you will need to specify 2. Other lines should be familiar to you by now,as they implement MemCpy Operations,allocation,measuring performace and at least friing the used memory. If you would like to update this version,note that also created records needs to be destroyed by calling ,,cudaEventDestroy(start)” … and stop of course.

  • In case you are using Visual Studio 2010,you need to tell the compiler,that it has to include cufft.lib : Project ->Properties ->Configuration Properties -> Linker ->Input ->Additional Dependencies -> add ,,cufft.lib”


  • In my case, I was a bit … well how to say it … depressed,as my version of FFT was about 7 times slower :D … (Anyway i kn ew it was not optimized) …  Which in other words means,that i will need to update my algorithm ;)