 As we know, there is an effort to make everything faster, even for example matrix multiplication, which is my favorite job. Do you know how proud can you become by multiplying 2 matrixes? :D … Well try and you will definitely fall in love :) … Actually, dont thnik of me as a maniac, but the truth is, that I love making things going faster. There is no doubt, that multiplying for example two matrixes of width= 5 and height = 5 makes no sense, as this multiplication is calculated in microseconds on any CPU. However, what if we want to multiply larger matrixes … for example 1024 rows and 1024 colums? Thats exactly a place for speedup. On a CPU, you would run 2 for cycles so you will be able to observe all elements of all three matrixes. But for each point, you will have to use another for cycle for calculating the dot product of the row from matrix A and column from matrix B …. as you see, you will need 1024 * 1024 * 1024 iterations.

Note that each needs to multiply and add … This is going to be extremly slow for any CPU … Fortunately, we dont have to use the 2 outer cycles if we let the GPU do the work for us!  Basically, one would run a block of 16 x 16 threads in x and y direction. By doing so, we will need ,,width/16″ blocks in grid-X-dimension and ,,height/16″ blocks in grid-Y-dimension. Note that we also need to allocate and transfer input arrays of mattix A and B into the GPU.

I dont know how about you, but I like to work with single precision (float) … ao assume, that we have widtd = height = DIM = 1024. We will need to allocate space for 3 Matrixes of 1024*1024 elements, where each element needs 4 bytes. This takes  just about 13Mb of GPU Memory, but when we increase the DIM variable to 8192, one will need 800Mb of memory which can sometimes be a limiting factor for any calculation …  Fortunately, we cant afford to run 67 Million threads on the GPU device :)) …   Lets see a basic implementation of the CUDA Kernel :

The code is self-explanatory: At first, we calculate the x and y index of each thread in the running grid. Then we are going to calculate the offset, which will specify the element posittion in 1D array of floats in matrix A, B and C. We can also check, that a thread is able to calculate a value for the result matrix. (In case  we can run more blocks than we need, these will  be otherwise accessing memory locations, that have not been allocated !). And now, instead of running 3 for cycles, we are gonna launch just one, which will be calculating the dot product for an offset element in matrix C. Note that we are using 2 float registers inside the cycle. temp1 which is the current processing element from matrix A and temp2 which is the current processing element from matrix B. We add their multiplication result to a ,,sum” register … then we just write the result back into the global memory. This algoritm is much faster than the CPU version, but has one big problem: Global memory accesses. Each thread makes 3*DIM acessess to the global memory, which has of course limited bandwidth. The question now is: Can we limit the number of accessess? Sure we can, but at first, look at this image: As we see, we can divide the dot product of entire row and column into parts (Phase 0 –  3). If we do that,we can loop across each phase and calculate a partial dot product. This means that we will need to use 2 for cycles. One inner to loop acrosss a phase and the second to loop across all the phases. We just need to decide  what block dimension to choose. As we will be using shared memory,we have a limited amount of storage space available. On older graphics cards, there is about 16Kbytes of Shared memory. If we choose 16*16 block dimension. Each block will need 1024Bytes for one shared phase. But note, that we need to use 2 shared locations to improve performace (one for Matrix A and one for matrix B), making the total a 2048 Bytes. This is quite enought, because each Streaming Multiprocessor inside the GPU can run a limited number of blocks at a time. If we use 2048 Bytes per block, no more than 8 blocks can be running on an SM (with 16Kb __shared__ memory). This i s however a good choice. On my GTX 560Ti, I have measured the calculation time to be about 200ms faster (from 870 ms to 670 ms). So now onto the Optimized Kernel:

I have added some usefull coments,as you can see we have 2 loops (Across all phases and across a single phase ).  The most interesting part is however loading values into the shared memory. Each thread observes one element in the shared memory, so its quite clear, that we will be using threadIdx.x and threadIdx.y to access the shared memory. To calculate a corresponding offset is a bit harder to explain. However we will for example cover loading into matrix_a:  DIM*row means the beginning of a row in 1D space. Adding additional ,, i*TILE” offset sets us at the beginning of a phase (with i index). And as we said, each thread reads one value: for row,we use threadIdx.x as its the horizontal direction. Therefore: (DIM*row) + (TILE*i + threadIdx.x). Well and at last: the host code. Which in my case is quite big,because i was testing the performance between CPU,GPU and Optimized GPU … Anyway all GPU version needs to allocate space on the GPU,the copy input matrixes,launch kernel and copy back the result matrix. If you would like to benchmark your CPU,do not use bigger DIM than 1024 (Well 2048 is acceptable if you dont have anythink to do and you have a couple of coffe next to your keyboard …).