Debugging CUDA is not an easy task. Even simple kernels calls can cause a lot of trouble and headache. Especially when working with the shared memory on the device. Quickly written standard indexing functions are the very first source of most problems along with things that seems to be obvious on the first sight. Detection of these faults usually requests a lot of time ,because even if you go over the code all all round again, you cant see the typo in the code, because it simply seems to be logic and working correctly to your mind. Nevertheless it causes the problem. One of my own solutions is to write a completely new kernel or to test results with matlab. The first method takes a usually a lot of time as well as the second method. 

Fortunately, Visual studio along with CUDA binaries can help us to detect these faults right away. First of all, its is necessary to enable GPU debug and line information for the debugger as seen in the image below:


The next step is to build your application and copy “cuda-memcheck.exe” from you CUDA GPU Computing Toolkit (eg. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin ) to your application directory. Then launch cmd (preferably as admin) and type “cuda-memcheck.exe ./ Application.exe“. Your application will be launched and any memory failures will be easily detected:


As we can see from the image above,the error above comes from reading a memory piece,that is “out of bounds“.Because we have added line information to our application,the cuda-memcheck is able to trace the error with a precision of single line! Unless this line contains the whole kernel, the possible correction is now easy to deploy thanks to cuda-memcheck. To simulate this error,I have written a very simple “almost a script” code with offset,that guarantee the memory access failure: 

Simply run “cuda-memcheck.exe -help” for more options.