Today, I would like to show you how to implement a C# application that will be able to acces the computation power of your GPU using CUDA. As you may know, CUDA is a part of C/C++ language, so we will need to find a way how to cooperate these two languages. For those, who are rather familiar with Java I would say, that C# and Java are very similar languages, both of them manages their memory and both of them are object-oriented, however most of the C/C++ codes mainly consists of Pointers* (Or at least I am still using them) …

I have been wondering what kind of demonstration CUDA code to use and I have finally decided to do my favorite dot product :D … (Dot product calculates the sum of two multiplied vectors: A[0]*B[0]+A[1]*B[1] … +A[N]*B[N] ). For simplicity, I have filled these two vectors with ,,1″ and ,,2″ so the total result can be calculated as N*2 … very simple isnt it? … Now back to our mission, I have added some code that will be measuring the time needed to do the calculation on both the GPU and CPU.  This is how my application looks like in Microsft Visual Studio 2012 GUI designer.

 

VisualStudio

I have added 3 buttons, the first one ,,info” is just a message box saying something similar to what is written here. There is a Textbox,where the user has to enter a number N (the lenght of those two vectors). Last 4 Text Boxes are self-explanatory due to the labels used. The most interesting part comes with buttons ,,CPU” and ,,GPU”. Those however dont need any explamation, so lets see what the CPU does:

 

As you see its very simple and most of the code is just a needed garbage, that is measuring time and showing results, now lets see what the GPU button does:

Well this is more interesting but at least you can see,there is some ,,IntPtr” named Pointer which is calling some function ,,GPUKern”. At first, I would like to say that IntPtr is a general purpose  C# pointer for cooperation with other languages (C/C++). I can tell you that function GPUKern is an external function of the application available through a DLL (Dynamic Link Library) and that this fuction returns a pointer to a field of 2 floats (so in C terms,its: float* Pointer;). Now you see,there is some Marshall (Fortunately not Admiral, General or anything else :D ) … This is a fuction,that copies data between managed memory of C# application and unmanaged memory of C/C++ application. You have to specify the source and destination fields along with its starting and ending indexes (in our case 0 and 2). Now the results are available in our native C# code.

There is however one additianal problem. You have to tell the application,that there is some exteranl function GPUKern and that it needs to load a DLL library in order to use that function. This can be accomplished by these two lines of code:

  • Please note that in order for the application to work correctly,the .dll must be in the same folder (or system folder probably).

Now that we know, how the application looks like on the C# side, we will have a look at the creating of the DLL library. In Visual Studio, you have to create a new empty console project of C/C++ and select that you want to create a DLL library during the wizard. The Next step is to go to build customization of the project and select CUDA 5.5. Additionally, you need to got o Project Properties -> configuration properties ->linker ->  input and add ,,cudart.lib”. Now you can create an empty file with .cu extension.

You can work with this file as you would with any other C/C++ project, but you dont need any ,,Main” function. Instead you type all of your ,,includes” and write ,,   extern “C” {}     “at the beginning, but lets see how its looks like:

Show code

 

You should be familiar with most of the code. There is an error handeling function, that is used in our GPUKern function. Furthermore, there is the Kernel code beginning with ,,__global__ void”. A few notes about this code: I have decided to make things very simple, so each running block is a 1D block consisting of 512 running threads (Please note that there is a limitation to 65536 threads in any dimension in the CUDA Kernel Grid, so the code will be very ineffective for larger numbers). As you know, the dot product consists of a sum operation, which is not effective for the GPU either. But we can make it at least a bit more effective by using the shared memory and summing the vectors in each block using an operation called ,,parallel reduction“. Then  we quit and let the CPU finish our work.

  • Also note the ,,__declspec(dllexport)before our GPUKern function. This statement allows our function to be used in any application.

Other parts of the code are the very basics of cuda programming: Allocating memory on the GPU, copying data, launching Kernel, copying back data and finally cleaning the memory. Now comes the problem of this demonstartion code. As the dot product is a  very quick operation and since we are using only up to 65536 threads, the CPU is faster in any case :D … Usually GPU codes are becomming more effective for larger and more complex operations,so we should have been expecting a very bad result in comparsion between the CPU and GPU :)