Originally, I wanted to create this post about half a year earlier, but :D …

Well never mind, the problem was: I was implementing an algorithm in CUDA and I was using the modulo (%) operator in each kernel with different parameters. I personally think it was either FFT or Bitonic sort …. but It doesnt matter anyway. So … up to some step, the algorithm was doing exactly what I wanted, but since that point, everything went wrong. I remember I was investigating that problem for a few days until I have finally found what was causing it. After I had discovered what went wrong, I went on google searching for similar problems as I was expecting that many other people would have the same issues. Unfortunately I havent found any post regarding the modulo 65536 issue so I decided to create this post.

Now onto the problem: If you are using ,,* %*” in your CUDA application, make sure its no longer than 65536 (At least in 2^n sequence). So just an example: (

**X % 131072**) is wrong and you get a different result than you want. The same applies for larger values, its safe to use only smaller values eg. (

**X % 32**or so …). To prove this, I have wrote a very simple application that uses both

**% 65536**and

**% 131072**in the main CUDA kernel. Then of course I am checking for errors in the output. So lets see what it looks like:

Main Kernel:

1 2 3 4 5 6 7 8 9 10 11 |
//Simple modulo kernel __global__ void Kernel(int *in, int *out,int *out2) { //Remap 2D grid to 1D field int tix = threadIdx.x + blockIdx.x*blockDim.x; int tiy = threadIdx.y + blockIdx.y*blockDim.y; int tid = tix + tiy*gridDim.x*blockDim.x; out[tid] = in[tid] % 65536; out2[tid] = in[tid] % 131072; } |

Host Code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
//Allocate memory cudaMalloc((void**)&dev_in,N*sizeof(int)); cudaMalloc((void**)&dev_out, N*sizeof(int)); cudaMalloc((void**)&dev_out2, N*sizeof(int)); cudaHostAlloc((void**)&host, N*sizeof(int), cudaHostAllocDefault); //Fill input data for (size_t i = 0; i < N; i++){ host[i] = i; } //Copy to GPU cudaMemcpy(dev_in, host, N*sizeof(int), cudaMemcpyHostToDevice); //Launch Kernel dim3 THREADS_PER_BLOCK(1024, 1, 1); //1024 Threads in X dim only dim3 BLOCKS_PER_GRID(64,N/65536); //64 Blocks in X - Dimension + 16 Blocks in Y - Dimension Kernel << <BLOCKS_PER_GRID, THREADS_PER_BLOCK, 0, 0 >> >(dev_in, dev_out,dev_out2); //Copy from GPU cudaMemcpy(host, dev_out, N*sizeof(int), cudaMemcpyDeviceToHost); //Check the results of the first output int err0 = 0; for (size_t i = 0; i < N; i++) { if (host[i] != (int)(fmodf((float)i, (float)65536))){ err0++; } } //Copy second output cudaMemcpy(host, dev_out2, N*sizeof(int), cudaMemcpyDeviceToHost); //Check the results of the second output int err1 = 0; for (size_t i = 0; i < N; i++) { if (host[i] != (int)(fmodf((float)i, (float)65536))){ err1++; } } //Show error info cout << "Errors in CUDA when using % 65536: " << err0 << endl; cout << "Errors in CUDA when using % 131072: " << err1 << endl; |

As you can see, Its a very simple application (Even without checking against cudaErrors). I believe the code is self-explanatory for both parts. We initialize data, send it to the GPU, then process it using our kernel – this results into 2 outputs. Then we copy our first GPU output back to the host and check it against failure with the single precision function of %. Then we do the same with the second output. Finally we let the user know how many errors there are in both outputs.

You can download VS2013 Project with a release version (CUDA 6.5) here – No there are not any viruses. I am still not able to write one :D …

My own output is as follows:

How to solve this problem: There are 2 options. The first is to use a single/double precision function on the device (fmod). This solution however leads to performance issues, since this function needs to use SFU (Special Function Unit) on the GPU and each processor on the GPU has a very limited number of these (Usually 32 nowdays, but this depends on the architecture). A better solution is to replace the modulo operator “**i % n**” with “**i &(n-1)**” in case n is a power of 2.