Originally, I wanted to create this post about half a year earlier, but :D …

Well never mind, the problem was: I was implementing an algorithm in CUDA and I was using the modulo (%) operator in each kernel with different parameters. I personally think it was either FFT or Bitonic sort …. but It doesnt matter anyway. So … up to some step, the algorithm was doing exactly what I wanted, but since that point, everything went wrong. I remember I was investigating that problem for a few days until I have finally found what was causing it. After I had discovered what went wrong, I went on google searching for similar problems as I was expecting that many other people would have the same issues. Unfortunately I havent found any post regarding the modulo 65536 issue so I decided to create this post.

Now onto the problem: If you are using  ,,%”  in your CUDA application, make sure its no longer than 65536 (At least in 2^n sequence). So just an example:  (X % 131072)  is wrong and you get a different result than you want. The same applies for larger values, its safe to use only smaller values eg. (X % 32 or so …). To prove this, I have wrote a very simple application that uses both % 65536 and % 131072 in the main CUDA kernel. Then of course I am checking for errors in the output. So lets see what it looks like:

Main Kernel:

 Host Code:

As you can see, Its a very simple application (Even without checking against cudaErrors). I believe the code is self-explanatory for both parts. We initialize data, send it to the GPU, then process it using our kernel –  this results into 2 outputs. Then we copy our first GPU output back to the host and check it against failure with the single precision function of %. Then we do the same with the second output. Finally we let the user know how many errors there are in both outputs.

You can download VS2013 Project with a release version (CUDA 6.5) here – No there are not any viruses. I am still not able to write one :D …

My own output is as follows: 



How to solve this problem: There are 2 options. The first is to use a single/double precision function on the device (fmod). This solution however leads to performance issues, since this function needs to use SFU (Special Function Unit) on the GPU and each processor on the GPU has a very limited number of these (Usually 32 nowdays, but this depends on the architecture). A better solution is to replace the modulo operator  “i % n”  with  “i &(n-1)” in case n is a power of 2.