On a CPU, you would run 2 for cycles so you will be able to observe all elements of all three matrixes. But for each point, you will have to use another for cycle for calculating the dot product of the row from matrix A and column from matrix B …. as you see, you will need 1024 * 1024 * 1024 iterations.

Note that each needs to multiply and add … This is going to be extremly slow for any CPU … Fortunately, we dont have to use the 2 outer cycles if we let the GPU do the work for us! Basically, one would run a block of 16 x 16 threads in x and y direction. By doing so, we will need ,,width/16″ blocks in grid-X-dimension and ,,height/16″ blocks in grid-Y-dimension. Note that we also need to allocate and transfer input arrays of mattix A and B into the GPU.

I dont know how about you, but I like to work with single precision (float) … ao assume, that we have widtd = height = DIM = 1024. We will need to allocate space for 3 Matrixes of 1024*1024 elements, where each element needs 4 bytes. This takes just about 13Mb of GPU Memory, but when we increase the DIM variable to 8192, one will need 800Mb of memory which can sometimes be a limiting factor for any calculation … Fortunately, we cant afford to run 67 Million threads on the GPU device :)) … Lets see a basic implementation of the CUDA Kernel :

__global__ void Kernel(float *a,float *b,float *c) { int xid = threadIdx.x + blockDim.x * blockIdx.x; int yid = threadIdx.y + blockDim.y * blockIdx.y; int offset = xid +yid *blockDim.x*gridDim.x; if(offset <DIM*DIM){ float sum = 0; for(int i = 0;i<DIM;i++){ float temp1 = a[yid * DIM + i]; float temp2 = b[DIM*i]; sum += temp1*temp2; } c[offset] = sum; } } // UPDATE 16.11.2016: THIS IS THE CORRECT VERSION __global__ void Kernel(float *a,float *b,float *c,int Nx,int Ny) { int tix = threadIdx.x + blockDim.x * blockIdx.x; int tiy = threadIdx.y + blockDim.y * blockIdx.y; if(tix < Nx && tiy < Ny){ int tid = tix + tiy * Nx; float sum = 0; for(int i = 0;i<Nx;i++){ float temp1 = a[tiy * Nx + i]; float temp2 = b[Nx * i + tix]; sum += temp1*temp2; } c[tid] = sum; } }

The code is self-explanatory: At first, we calculate the x and y index of each thread in the running grid. Then we are going to calculate the offset, which will specify the element posittion in 1D array of floats in matrix A, B and C. We can also check, that a thread is able to calculate a value for the result matrix. (In case we can run more blocks than we need, these will be otherwise accessing memory locations, that have not been allocated !). And now, instead of running 3 for cycles, we are gonna launch just one, which will be calculating the dot product for an offset element in matrix C. Note that we are using 2 float registers inside the cycle. temp1 which is the current processing element from matrix A and temp2 which is the current processing element from matrix B. We add their multiplication result to a ,,sum” register … then we just write the result back into the global memory. This algoritm is much faster than the CPU version, but has one big problem: Global memory accesses. Each thread makes 3*DIM acessess to the global memory, which has of course limited bandwidth. The question now is: Can we limit the number of accessess? Sure we can, but at first, look at this image:

As we see, we can divide the dot product of entire row and column into parts (Phase 0 – 3). If we do that,we can loop across each phase and calculate a partial dot product. This means that we will need to use 2 for cycles. One inner to loop acrosss a phase and the second to loop across all the phases. We just need to decide what block dimension to choose. As we will be using shared memory,we have a limited amount of storage space available. On older graphics cards, there is about 16Kbytes of Shared memory. If we choose 16*16 block dimension. Each block will need 1024Bytes for one shared phase. But note, that we need to use 2 shared locations to improve performace (one for Matrix A and one for matrix B), making the total a 2048 Bytes. This is quite enought, because each Streaming Multiprocessor inside the GPU can run a limited number of blocks at a time. If we use 2048 Bytes per block, no more than 8 blocks can be running on an SM (with 16Kb __shared__ memory). This i s however a good choice. On my GTX 560Ti, I have measured the calculation time to be about 200ms faster (from 870 ms to 670 ms). So now onto the Optimized Kernel:

__global__ void OptimisedKernel(float *a,float *b,float *c){ //INDEXES int xid = threadIdx.x + blockDim.x * blockIdx.x; int yid = threadIdx.y + blockDim.y * blockIdx.y; int offset = xid +yid *blockDim.x*gridDim.x; //ROW AND COLUMN int row = blockIdx.y * TILE + threadIdx.y; int col = blockIdx.x * TILE + threadIdx.x; //DECLARE TEH USAGE OF SHARED MEMORY __shared__ float matrix_a [TILE][TILE]; __shared__ float matrix_b [TILE][TILE]; //CHECK OFFSET if(offset <DIM*DIM){ float sum = 0; //PHASE LOOP for(int i = 0;i<(DIM/TILE);i++){ //LOAD PARTS INTO SHARED MEMORY matrix_a[threadIdx.y][threadIdx.x] = a[(DIM*row) + (i*TILE + threadIdx.x)]; matrix_b[threadIdx.y][threadIdx.x] = b[(i * TILE + threadIdx.y)*DIM + col] ; __syncthreads(); for(int j = 0;j<TILE;j++){ //LOOP ACROSS SHARED MEMORY sum += matrix_a[threadIdx.y][j] * matrix_b[j][threadIdx.x]; } __syncthreads(); } c[offset] = sum; } }

I have added some usefull coments,as you can see we have 2 loops (Across all phases and across a single phase ). The most interesting part is however loading values into the shared memory. Each thread observes one element in the shared memory, so its quite clear, that we will be using threadIdx.x and threadIdx.y to access the shared memory. To calculate a corresponding offset is a bit harder to explain. However we will for example cover loading into matrix_a: DIM*row means the beginning of a row in 1D space. Adding additional ,, i*TILE” offset sets us at the beginning of a phase (with i index). And as we said, each thread reads one value: for row,we use threadIdx.x as its the horizontal direction. Therefore: (DIM*row) + (TILE*i + threadIdx.x). Well and at last: the host code. Which in my case is quite big,because i was testing the performance between CPU,GPU and Optimized GPU … Anyway all GPU version needs to allocate space on the GPU,the copy input matrixes,launch kernel and copy back the result matrix. If you would like to benchmark your CPU,do not use bigger DIM than 1024 (Well 2048 is acceptable if you dont have anythink to do and you have a couple of coffe next to your keyboard …).