Today, I would like to show you how to implement a C# application that will be able to acces the computation power of your GPU using CUDA. As you may know, CUDA is a part of C/C++ language, so we will need to find a way how to cooperate these two languages. For those, who are rather familiar with Java I would say, that C# and Java are very similar languages, both of them manages their memory and both of them are object-oriented, however most of the C/C++ codes mainly consists of Pointers* (Or at least I am still using them) …
I have been wondering what kind of demonstration CUDA code to use and I have finally decided to do my favorite dot product :D … (Dot product calculates the sum of two multiplied vectors: A[0]*B[0]+A[1]*B[1] … +A[N]*B[N] ). For simplicity, I have filled these two vectors with ,,1″ and ,,2″ so the total result can be calculated as N*2 … very simple isnt it? … Now back to our mission, I have added some code that will be measuring the time needed to do the calculation on both the GPU and CPU. This is how my application looks like in Microsft Visual Studio 2012 GUI designer.
I have added 3 buttons, the first one ,,info” is just a message box saying something similar to what is written here. There is a Textbox,where the user has to enter a number N (the lenght of those two vectors). Last 4 Text Boxes are self-explanatory due to the labels used. The most interesting part comes with buttons ,,CPU” and ,,GPU”. Those however dont need any explamation, so lets see what the CPU does:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
private void CPUbutton_Click(object sender, EventArgs e) { //INITIALIZE string temp = this.textBox1.Text; int length = int.Parse(temp); int res = 0; //BEGIN TIME MEASURING Stopwatch stopWatch = new Stopwatch(); stopWatch.Start(); //CLACULATE for (int i = 0; i < length; i++) { res += (1) * (2); } //END TIME MEASURING stopWatch.Stop(); TimeSpan ts = stopWatch.Elapsed; //WRITE THE CALCULATION RESULT AND TIME string elapsedTime = ((ts.Seconds*1000) + (ts.Milliseconds / 10)) + "ms"; this.textBox5.Text = String.Format("{0:00.0}", res) + ""; this.textBox6.Text = elapsedTime; } |
As you see its very simple and most of the code is just a needed garbage, that is measuring time and showing results, now lets see what the GPU button does:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
private void GPUbutton_Click(object sender, EventArgs e) { string temp = this.textBox1.Text; int length = int.Parse(temp); float[] res = new float[2]; IntPtr pointer; //CALL GPU TO DO THE HARD WORK pointer = GPUKern(length); Marshal.Copy(pointer, res, 0, 2); this.textBox3.Text = String.Format("{0:00.0}", res[0]) + ""; //TIME this.textBox4.Text = res[1] + "ms"; } |
Well this is more interesting but at least you can see,there is some ,,IntPtr” named Pointer which is calling some function ,,GPUKern”. At first, I would like to say that IntPtr is a general purpose C# pointer for cooperation with other languages (C/C++). I can tell you that function GPUKern is an external function of the application available through a DLL (Dynamic Link Library) and that this fuction returns a pointer to a field of 2 floats (so in C terms,its: float* Pointer;). Now you see,there is some Marshall (Fortunately not Admiral, General or anything else :D ) … This is a fuction,that copies data between managed memory of C# application and unmanaged memory of C/C++ application. You have to specify the source and destination fields along with its starting and ending indexes (in our case 0 and 2). Now the results are available in our native C# code.
There is however one additianal problem. You have to tell the application,that there is some exteranl function GPUKern and that it needs to load a DLL library in order to use that function. This can be accomplished by these two lines of code:
1 2 3 4 5 6 7 8 9 |
namespace SimpleCSharpWithCuda { public partial class Form1 : Form { [DllImport("GPUCSharpDLL.dll", CharSet = CharSet.Unicode)] public static extern IntPtr GPUKern(int num); ... } } |
- Please note that in order for the application to work correctly,the .dll must be in the same folder (or system folder probably).
Now that we know, how the application looks like on the C# side, we will have a look at the creating of the DLL library. In Visual Studio, you have to create a new empty console project of C/C++ and select that you want to create a DLL library during the wizard. The Next step is to go to build customization of the project and select CUDA 5.5. Additionally, you need to got o Project Properties -> configuration properties ->linker -> input and add ,,cudart.lib”. Now you can create an empty file with .cu extension.
You can work with this file as you would with any other C/C++ project, but you dont need any ,,Main” function. Instead you type all of your ,,includes” and write ,, extern “C” {} “at the beginning, but lets see how its looks like:
Show code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
extern "C"{ static void HandleError( cudaError_t err, const char *file, int line ) { if (err != cudaSuccess) { int aa = 0; printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line ); scanf("%d",&aa); exit( EXIT_FAILURE ); } } #define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) //DOT PRODUCT KERNEL __global__ void Kernel(float * input1,float * input2,float * output,int num){ int tid = threadIdx.x + blockIdx.x*blockDim.x; int tx = threadIdx.x;float temp = 0; __shared__ float Cache[512]; //Check while(tid<num){ temp += input1[tid] *input2[tid]; tid += blockDim.x*gridDim.x; } Cache[tx] = temp; __syncthreads(); //Begin Parallel Reduction int i = blockDim.x/2; while(i != 0 ){ if(tx < i ){ Cache[tx] += Cache[tx +i]; } __syncthreads(); i /= 2; } //WRITE RESULT if(threadIdx.x == 0){ output[blockIdx.x] = Cache[0]; } } __declspec(dllexport) float* GPUKern(int num){ //INITIALIZE float* dev_input1; float* dev_input2; float* dev_output; cudaEvent_t start,stop; float elapsedTime = 0; float* input1 = (float*)malloc(sizeof(float)*num); float* input2 = (float*)malloc(sizeof(float)*num); float* res = (float*)malloc(2*sizeof(float)); int blocks = num/512; //NUMBER OF RUNNING BLOCKS if(blocks > 128){ blocks = 128; //128*512 =65536 = max 1D threads } float *host_res = (float*)malloc(sizeof(float)*blocks); res[0] = 0;res[1]=0; //FEED INPUT for(int i = 0;i<num;i++){ input1[i] = (float)(1); input2[i] = (float)(2); } //ALLOCATE MEMORY FOR INPUT HANDLE_ERROR(cudaMalloc(&dev_input1,sizeof(float)*num)); HANDLE_ERROR(cudaMalloc(&dev_input2,sizeof(float)*num)); HANDLE_ERROR(cudaMalloc(&dev_output,sizeof(float)*blocks)); HANDLE_ERROR(cudaEventCreate(&start)); HANDLE_ERROR(cudaEventCreate(&stop)); HANDLE_ERROR(cudaEventRecord(start,0)); //COPY INTO DEVICE HANDLE_ERROR(cudaMemcpy(dev_input1,input1,sizeof(float)*num,cudaMemcpyHostToDevice)); HANDLE_ERROR(cudaMemcpy(dev_input2,input2,sizeof(float)*num,cudaMemcpyHostToDevice)); //DIMENSIONS dim3 BLOCKS_PER_GRID(blocks); dim3 THREADS_PER_BLOCK(512,1,1); int runfor = num - (num % 512); int rest = (num % 512); //LAUNCH KERNEL Kernel<<<BLOCKS_PER_GRID,THREADS_PER_BLOCK>>>(dev_input1,dev_input2,dev_output,runfor); //COPY BACK RESULTS HANDLE_ERROR(cudaMemcpy(host_res,dev_output,sizeof(float)*blocks,cudaMemcpyDeviceToHost)); //FINISH ON THE CPU SIDE for(int i = 0;i<blocks;i++){ res[0] += host_res[i]; } //SO THE INPUT CAN BE ANY NUMBER AND NOT JUST POWER OF 2 for(int i = 0;i<rest;i++){ res[0] += input1[i+blocks*512]*input2[i+blocks*512]; } //GET CALCULATION TIME HANDLE_ERROR(cudaEventRecord(stop,0)); HANDLE_ERROR(cudaEventSynchronize(stop)); HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime,start,stop)); res[1] = elapsedTime; //FREE GPU MEMORY HANDLE_ERROR(cudaFree(dev_input1)); HANDLE_ERROR(cudaFree(dev_input2)); HANDLE_ERROR(cudaFree(dev_output)); HANDLE_ERROR(cudaEventDestroy(start)); HANDLE_ERROR(cudaEventDestroy(stop)); return res; } } |
You should be familiar with most of the code. There is an error handeling function, that is used in our GPUKern function. Furthermore, there is the Kernel code beginning with ,,__global__ void”. A few notes about this code: I have decided to make things very simple, so each running block is a 1D block consisting of 512 running threads (Please note that there is a limitation to 65536 threads in any dimension in the CUDA Kernel Grid, so the code will be very ineffective for larger numbers). As you know, the dot product consists of a sum operation, which is not effective for the GPU either. But we can make it at least a bit more effective by using the shared memory and summing the vectors in each block using an operation called ,,parallel reduction“. Then we quit and let the CPU finish our work.
- Also note the ,,__declspec(dllexport)” before our GPUKern function. This statement allows our function to be used in any application.
Other parts of the code are the very basics of cuda programming: Allocating memory on the GPU, copying data, launching Kernel, copying back data and finally cleaning the memory. Now comes the problem of this demonstartion code. As the dot product is a very quick operation and since we are using only up to 65536 threads, the CPU is faster in any case :D … Usually GPU codes are becomming more effective for larger and more complex operations,so we should have been expecting a very bad result in comparsion between the CPU and GPU :)