Sometimes I believe I can speedup my algorithm by manually implementing specific kernels such as a vector sum, vector multiply or matrix multiply. I usually spent a lot of time coding and an impressive amount of time debugging. But is it really necessary? I will take a short look at the matrix matrix multiply,which was discussed several times on my blog. Its a computing intensive task with high memory usage. The naive implementation is extra easy to write. Its bottleneck however lies in the limited throughput of the GPU memory. Therefore a method called “Tiling” can be used along with shared memory on the GPU to reduce the memory load and reduce the computing time. This implementation is however no longer as easy and some at least basic knowledge of the GPU and its hardware is required. I have been testing recently whether its worth optimizing my own kernel,or just use the nVidia’s CuBLAS library. The result is somewhat fascinating.

### Custom Global & Shared Impl.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
__global__ void MatrixMatrixMultiplyGlobal( float *MatrixIN1, float *MatrixIN2, float *MatrixOut,int Nx,int Ny) { int tix = threadIdx.x + blockIdx.x*blockDim.x; int tiy = threadIdx.y + blockIdx.y*blockDim.y; if(tix< Nx && tiy < Ny){ float Res = 0.0f; for(int i = 0;i<Nx;i++){ int tid1 = tiy * Nx + i; int tid2 = tix + Nx*i; Res += MatrixIN1[tid1] * MatrixIN2[tid2]; } MatrixOut[tix + tiy * Nx] = Res; } } |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
__global__ void MatrixMatrixMultiplyShared( float *MatrixIN1, float *MatrixIN2, float *MatrixOUT, int Nx, int Ny ) { int tix = threadIdx.x + blockDim.x * blockIdx.x; int tiy = threadIdx.y + blockDim.y * blockIdx.y; int Tile = blockDim.x; /* --- RECOMMENDED TILESIZE: 16 x 16 THREADS | 32 x 32 THREADS --- AMOUNT OF SHARED MEMORY IS A PARAMETER TO KERNEL CALL DURING RUNTIME THE DEFAULT ALLOCATION IS FOR ONE ARRAY. CREATION OF MULTIPLE DYNAMIC ARRAYS MUST BE SOLVED MANUALLY */ extern __shared__ float matrix_Data[]; float *matrix_a = &matrix_Data[0]; float *matrix_b = &matrix_Data[Tile*Tile]; //FILL WITH ZEROS AND SYNC matrix_a[Tile*threadIdx.y + threadIdx.x] = 0.0f; matrix_b[Tile*threadIdx.y + threadIdx.x] = 0.0f; __syncthreads(); float FinalSum = 0; for( int i = 0; i < ( Nx / Tile ) + 1; i++ ) { //LOAD PARTS INTO SHARED MEMORY AND SYNC int Load1X = ( i*Tile + threadIdx.x ); int Load2Y = ( i*Tile + threadIdx.y ); // LOAD DATA FROM MEMORY if( Load1X < Nx && tiy<Ny ) { matrix_a[Tile*threadIdx.y + threadIdx.x] = MatrixIN1[Load1X + tiy*Nx]; } else { matrix_a[Tile*threadIdx.y + threadIdx.x] = 0.0f; } if( tix < Nx && Load2Y<Ny ) { matrix_b[Tile*threadIdx.y + threadIdx.x] = MatrixIN2[tix + Load2Y*Nx]; } else { matrix_b[Tile*threadIdx.y + threadIdx.x] = 0.0f; } // SYNC AFTER LOAD __syncthreads(); // SHARED MEMORY LOOP for( int j = 0; j < Tile; j++ ) { FinalSum += matrix_a[threadIdx.y*Tile + j] * matrix_b[j*Tile + threadIdx.x]; } // SYNC AFTER COMPUTING __syncthreads(); } if( tix < Nx && tiy < Ny ) { // SAVE DATA TO OUTPUT MATRIX MatrixOUT[tiy*Nx + tix] = FinalSum; } } |

Although one will likely encounter a situation,where MM has to be manually implemented, my advice is to use the power of nVidia’s libraries everywhere they can be used. The reason is given below (Tested on nVidia Quadro M2000 on CUDA 8.0 64-bit. Measured results are average of 1000 runs). The average speedup of CuBLAS over Shared memory implementation is roughly **10 times **(for non – trivial matrices). The difference could be possibly reduced by using intrinsic functions and optimizing via fast math,loop unrolling and eliminating divergent branches. Therefore, I do believe that beating the Blas library is not an easy task.

The only interesting part is my own implementation of the **Vector sum**. Its a bit faster, than the CuBLAS version, but the difference probably comes from missing data transfer of the result back to CPU, since I do not usually transfer the result. The sum of a 1024-element vector takes **8.1**μs while on CuBLAS** 115**μs (MSI GTX 960). I Still wonder what causes this difference :)