If you are looking for a guide to implement FFT in CUDA/OpenCL for your custom use with Radix 2, Radix 4, Radix 8 (And other powers of 2), you have come to the right place. This topic was discussed several times (not only) on my site, because its just an awesome algorithm :) I will not go into detail and jump straight into code, which was for testing purposes created as Matlab mex file dll extension. More can be found:

- https://www.beechwood.eu/fft-implementation-r2-dit-r4-dif-r8-dif/
- https://www.beechwood.eu/creating-matlab-mex-file/

First of all, to program FFT, we need a bitreverse(radix 2) / digitreverse (radix 4 8 16 …) algorithm. The purpose of this is so that the output from the FFT is not in order, by digit reversing the input indexes, we obtain the standard FFT output with DC position on the left of the array and high-frequency in the middle of the array (This can be centered with fftshift algorithm). The digit reversing algorithm is pretty easy to implement,first of all,we take our input index and transform it into a selected base (2,4,8…). Then we shift the integers and transform them back into base 10:

Although this method works perfectly and is suited for all kind of radixes, its important to note that its pretty slow in comparison with bit operators, which will not be discussed here. Its just for demonstration purposes. There are basically 2 options to store the selected base values. We can use integer for each number, or (to save some memory and registers) we can just use the most basic data unit – char, which suits this application as well. Another option would be to store all values in a single integer knowing that each decade represents in fact something different (23 = 2*4^1 + 3*4^0 = 11 in base 4). The second option includes more arithmetic operations. Anyway as I say, to properly program the FFT, one will likely use the bit operators.

A more straightforward method includes the index function for the bitreversal. The idea is quite simple. Given a data index, one has to find a function (or algorithm) that computes the reversed index of the data. In other words, from the table above, the function maps the leftmost column to the rightmost column. In fact I am using these functions very frequently in my CUDA codes in general mainly because they are easy to understand. Figuring out that function however sometimes takes hours of visualizing the data indexes – not to mention that we are dealing also with the radix, which is a variable to our function as well. I found out the simple MS Excel to be a very good assistant along with MATLAB for the function tests. The EXCELL sheet of visualizing various radixes can be downloaded HERE. I suggest you to take a look at it,especially on sheet 2 (R4 256).

In short, the linear function that maps the Input/Output indexes for digitreversal is defined as follows:

- Calculate amount of digits required (2 for Base 4 16 Point, 3 for Base 4 64 Point …)
- Find the “Group Distance”. Which is equal to N / R.
- Loop across the amount of digits and find corresponding values of the base radix power ( X.4
^{0 }+ Y.4^{1}^{ }+ Z.4^{2 }+ … ) - Do not forget to MOD() these values by the base radix (Explanation will be obvious after looking into the excell sheet)
- Join up all the results to get the output index.

Example (input Indexes 13,14,15 and 16. Radix 4 256-Point):

As you can see the Powers of 4 (1, 4, 16 and 64) are always multiplied by a number whose value ranges from 0 to (Radix – 1). Once the number of the highest power reaches (Radix – 1), the lower power value is incremented. And so on and on until we deplete all the input indexes. The Matlab implementation of this algorithm/function with comments can be downloaded HERE.

Another quite useful gadget in the CUDA code are the complex mathematic operators, which has to be defined for better readability of the code. I have implemented the FFT butterflies as __device__ functions:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
_device__ R2Complex Radix2Butterfly(cufftComplex A, cufftComplex B, cufftComplex T1, cufftComplex T2){ R2Complex Rout; Rout.A = (A + B)*T1; Rout.B = (A - B)*T2; return Rout; } __device__ R4Complex Radix4Butterfly(cufftComplex A, cufftComplex B, cufftComplex C, cufftComplex D, cufftComplex T1, cufftComplex T2, cufftComplex T3, cufftComplex T4){ R4Complex Rout; Rout.A = (A + B + C + D) * T1; Rout.C = (A - B + C - D) * T3; Rout.B = (A - CI(B) - C + CI(D))*T2; Rout.D = (A + CI(B) - C - CI(D))*T4; return Rout; } __device__ R8Complex Radix8Butterfly(cufftComplex A, cufftComplex B, cufftComplex C, cufftComplex D, cufftComplex E, cufftComplex F, cufftComplex G, cufftComplex H, cufftComplex T1, cufftComplex T2, cufftComplex T3, cufftComplex T4, cufftComplex T5, cufftComplex T6, cufftComplex T7, cufftComplex T8){ R8Complex Rout; Rout.X1 = (A + B+ C + D + E + F + G + H)*T1; Rout.X5 = (A - B + C - D + E - F + G - H)*T5; Rout.X3 = (A - CI(B) - C + CI(D) + E - CI(F) - G + CI(H))*T3; Rout.X7 = (A + CI(B) - C - CI(D) + E + CI(F) - G - CI(H))*T7; Rout.X2 = (A + CM( CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*B - CI(C) + CM(-CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*D - E + CM(-CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*F + CI(G) + CM( CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*H)*T2; Rout.X4 = (A + CM(-CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*B + CI(C) + CM( CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*D - E + CM( CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*F - CI(G) + CM(-CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*H)*T4; Rout.X6 = (A + CM(-CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*B - CI(C) + CM( CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*D - E + CM( CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*F + CI(G) + CM(-CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*H)*T6; Rout.X8 = (A + CM( CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*B + CI(C) + CM(-CUDART_SQRT_HALF_F, CUDART_SQRT_HALF_F)*D - E + CM(-CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*F - CI(G) + CM( CUDART_SQRT_HALF_F, -CUDART_SQRT_HALF_F)*H)*T8; return Rout; } |

As you can see, the R2 and R4 Radixes are very simple while the R8 looks a bit more complex. The reason why will be obvious after looking on the IO matrix in the first link. In fact the code is identical to the 2×2, 4×4 and 8×8 tables. The IO matrix can be generated also for other radixes:

1 2 3 4 5 6 7 8 9 10 11 |
% CREATE M - IO Butterfly Matrix % Can be parametrized with Radix variable M = nan(Radix,Radix); for i = 1:Radix for j = 0:Radix-1 Ex = mod(j*(i-1),Radix); M(i,j+1) = exp(-1j*2*pi*Ex/Radix); end end |

The implementation of the FFT is straightforward. We create a host-based main computing loop launching our kernel radixes. The only bad thing are the twiddle factors, that has to be computed somehow. Either they can be computed during each stage for each butterfly separately (current Implementation) or they can be precomputed and the threads must then be able to track the correct twiddle using a custom function. I have also structured the data from different radixes for code readability. In real scenarios,the butterflies should resides directly in the kernels and not outside as device functions. Worth noticing is also the increasing demand for registers. The higher the radix, the more registers are needed to calculate the butterfly. Therefore higher radixes end up launching less threads at once. The amount of operations saved by computing higher radixes can be found for example here: http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html . Important to note is that FFT’s speed is highly dependent on the memory bandwidth.

- The whole code is available for download (VS2013 64Bit CUDA 7 MATLAB R2016a).
- Call: GPURadix248(Data);

- Version with Twiddle factors precalculation: (VS2013 64Bit CUDA 7 MATLAB R2016a).
- Call: GPURadix248(Data,TwiddleOpt); TwiddleOpt = [0,1]

- Version with Shared memory* usage: (VS2013 64Bit CUDA 7 MATLAB R2016a).
- Call: GPURadix248(Data,TwiddleOpt,SMemOpt); SMemOpt = [0,1].

- Twiddle indexes visualization: ExcelTwiddles

**Note that DIF (Decimation In Frequency) algorithm implements shared memory on the last stages. DIT (Decimation in Time) implements shared memory on the first stages. This can be better visualized in the signal flowcharts. Shared memory is hard – coded for specific thread amounts. 512 Threads for Radix 2, 256 Threads for Radix 4 and 64 Threads for Radix 8. These numbers should take into account most GPU architectures when optimized. *

Additional improvements can be made by simplifying the code and removing redundancy commands. To additionally speed up the algorithm, usage of the native fast CUDA functions is recommended. Optionally,one can take advantage of the special CUDA memory types (Constant / Texture memory for storing the Twiddle factors ). At last, its important to note, that code can be optimized for a specific architecture as well. If you have some special questions, please leave it in the comments. Next time I update this post, I will focus on calculating FFT of other sizes as well.