CUDA QSwift Library with QR Eigenvalue solver
Today, I would like to talk a bit about computing eigenvalues with CUDA, but before we begin, I would like to add a short story, that made me in fact proceed to solving this problem.
It all started with my diploma thesis. I found a few interesting subjects for my thesis, but ” fortunately / unfortunately”, I left the fast computation of the eigenvalues with CUDA behind, proceeding with a different topic about image processing. The reason was, that I was new to eigenvalue computing and I would have to start from the beginning. This is usually not a good idea when you plan to finish your school. Though the problem was drifting in my mind for quite a while now. I do not know exactly when, but in the end, I decided to try to compute the eigenvalues as fast as I can with my GPU (besides, it was just hanging around, occasionally computing a few pixels to display …) And here we are today.
I will start with a short math intro,but will skip it eventually after a very short discussion. The eigenvalues of a square full-rank matrix can be computed the following way:
This simple example is really straightforward due to the fact,that the determinant of the matrix is extremely easy to compute. This thing however gets really complicated with increasing rank of the matrix. Therefore computing the eigenvalues of larger matrices must be done in a different way. The answer is the so called QR algorithm. Despite its simplicity, the QR can be as well time consuming. Because there is currently not a better solution to the eigenvalues computation, I have continued implementing the QR algorithm. As I said,its extremely easy. First of all,we have the Gram-Schmidt decomposition of a Matrix. Given matrix A,the decomposition produces 2 Matrices Q & R so that Q*R = A. I will get to Gram Schmidt decomposition later. Now the really funny part is that instead of multiplying the matrices in the specified order,we reverse them and form a new matrix A = R*Q. This process is iterated until all the eigenvalues show up on the matrix’s diagonal. Again, damm easy to understand. The good thing is that matrix multiplication can be done in CUDA extremely fast. I have implemented my own CUDA kernel with shared memory usage to calculate the MM multiply. Its not as fast as the CuBLAS implementation,but its speed is quite impressive anyway. I would say that the matrix multiplication is a base block of the QR algorithm.
As you can guess,the problem comes with the decomposition of the matrix. This has to be done also iteratively. During decomposition, we are subtracting projection of the previous column vector onto the currently computed column vector. Therefore, the first column is without any changes,but we have to subtract the projection of the first column vector onto the second column. To compute the third column,we have to subtract the projection from the first column,but also from the second column. This process is repeated for all the columns until the last one, from which are subtracted all the projections of the previous columns. This process is in fact orthogonalization. Lets take a look at some flowchart:
There are 2 bad things:
Vector normalization, which includes the sum operation. These types of operations are not very popular in CUDA, as they do not fully unveil the computing power of the GPU. Whats more and most important is that the decomposition has to be serialized. In fact, there are 2 hidden for loops: One that goes across the columns of the matrix and the second one, that computes the projections. At least, if we concentrate onto the first loop. We can determine, that once U0 is computed. We can subtract its projection in parallel from all the other columns at once. Unfortunately then we have to wait for this operation to finish, so we can continue subtracting U1. Whats more is that these projections are becoming less and less effective due to the lower amount of threads working on the subtraction process. Given a matrix of 1024*1024 elements,the second column subtraction is computed with 1023*1024 threads while the last subtraction is computed with only 1024 threads. The slowness is clearly visible. As a result,the only speedup comes from eliminating the first loop with CUDA.
Important as well is the amount of total iterations of the QR algorithm. Each Iteration includes 2 MM Multiplications and huge amount of kernel launches. The maximum number of iteration must be specified,although some reasonable amount is about 20. Whats bad as well is however the eventual need to check for convergence of this algorithm. Lets say the eigenvalues are perfectly clear after the first iteration (which is unlikely to happen). But how do I know that the eigenvalues are no longer changing with the iteration count? The obvious thing to do is to check against some threshold,which I call the “precision”. This value represents the threshold of the sum of changes of all the eigenvalues during the last iteration. If the sum of the eigenvalue changes is smaller than the precision,I usually end up iterating. Again, the bad thing is that this includes the sum operation as well.
Therefore, the overall speedup of the computation is not really overwhelming. In fact my CUDA implementation is slower than Matlab version, although I know I could go much closer or compute even faster with code optimization.
Fortunately,there is not only the Gram-Schmidt decomposition, that can be used for eigenvalue computation. The so called tridiagonalization can be used as well, although I have personally only implemented the Hauseholder Transformation, which in fact is very promising for future optimizations. The last one is the tridiagonalization using the Givens’s rotation.
The Hauseholder transformation is so far my favorite transform. Although the CUDA implementation is not super fast anyway for almost the same reasons, it does have some special properties. First of all, not as many operations are required for the computation. The Transformation iterates through all the columns as well, but eliminates each time both a row and a column. Therefore both the amount of rows and columns decreases during the computation. Wikipedia already contains all the required information for implementation. Important to note is that also 2 Matrix multiplications are required for each iteration.
The funny thing about my implementation in CUDA is that its one of the very few implementation, where I do launch just one thread during computation. In a highly parallel environment is this situation quite unusual, however the reason is obvious after looking on computing the following parameters: Ai, A21 and Ri. A CUDA user has 2 options: Either copy the data back to host and compute these parameters there, or as I did, compute them on the GPU.
Even the Heuseholder computation can be optimized and for example kernels, that care about choosing the right piece of the matrix can be possibly removed by calculating matrix offsets. The code would however suffer worse readability and the detection of eventuall bugs would become more dramatic.
The last thing to note is the creation of the Heauseholder matrix Px. Given Vectors Vi, the operation Vi*Vi’ is not as I though the vector(row*column) operation,but rather a vector(column*row) operation. The difference between the order of the multiplication is huge.
N = 250; Y = randn(N);
% REFERENCE MATLAB ALGORITHM
for i = 1:Ny-1
% SELECT A PIECE OF THE MATRIX
AP = Y(i:end,i:end);
[Nyi,~] = size(AP);
a21 = AP(2,1);
ai = -sign(a21)*sqrt(sum( (AP(2:end,1)).^2 ));
ri = sqrt(0.5 * (ai^2 -a21*ai));
Vi = zeros(Nyi,1);
Vi(1) = 0;
Vi(2) = (a21 – ai)/(2*ri);
for j = 3 :Nyi
Vi(j) = AP(j,1)/(2*ri);
Px = eye(Nyi) – 2*Vi*Vi’;
APX = Px*AP*Px;
% SAVE DATA TO THE ORIGINAL MATRIX
Y(i:end,i:end) = APX;
The Given’s rotation is another algorithm for tridiagonalization.It uses the rotation of a vector around a specific axis. This rotation is only 2-dimensional although higher dimension modification exists. The first step is to calculate the Given’s rotation matrix G,which is almost an identity matrix E . The next step is to multiply it with the input matrix from the left: A = G*A.The given’s rotation affects only 2 rows, therefore many rotations are required to “zero” a specific matrix’s column. Whats more is that one has to start from the bottom-left of the matrix. The result is of course identical to the Householder transform.
Although the matrix matrix multiply can be highly optimized to utilize only 2 rows, the truth is that the serialization cannot be overcome in this case very much. I have not even bothered implementing this transform in CUDA, as its completely not suitable for CUDA parallel environment. The Matlab Implementation is given below:
N = 250; % SIZE OF THE MATRIX
A = randn(N); % GENERATE RANDOM MATRIX
% LOOP ACROSS COLUMNS
for i = 1:N-1
disp([‘Processing: ‘,num2str(i),’ / ‘,num2str(N-1)]);
% LOOP ACROSS ROWS! MUST START FROM BOTTOM UP
for j = N-1:-1:i
CG = eye(N);
% FORM GIVENS ROTATION MATRIX – AFFECTS ONLY 2 ROWS
ac = A(j,i);
bc = A(j+1,i);
[c,s,r] = GivensRotCoef(ac,bc);
% These are the default coefs
%r = sqrt(ac^2 + bc^2);
%c = ac/r;
%s = -bc/r;
% C IS THE GIVEN’S ROTATION MATRIX
CG(j,j) = c;
CG(j,j + 1) = -s;
CG(j + 1,j) = s;
CG(j + 1,j + 1) = c;
A = CG*A;
QSwift CUDA Library
Because creating an example file for the methods above would be time-consuming,I decided to finally write a whole library in C++/CUDA with a Matlab wrapper. As I said before, it lacks in some ways optimizations, but more important for me was, that it finally provides valid results. The build in wrapper is designed for Matlab R2016A (.lib .h and .dll are availabble as well for C++ applications). Library is available only in the 64-bit version and is powered by CUDA 8.0 RT. To use this library,make sure that Microsoft redistributables for Visual Studio 2015 are installed. The only supported format is the single precision (float). Also note that only nVidia CUDA enabled graphics cards with Kepler / Maxwell / Pascal architectures are supported (Compute Capability 3+). If you are planning to use the lib in C / C++, bear in mind, that the library uses row-major order.The list of implemented methods follows:
As was previously discussed, the Matrix Matrix Multiplication is the core of the QR algorithm. Its a very computing intensive operation and a good candidate for measuring performace with Matlab. I have modified the CUDA kernel code in one of my previous posts to allow for non-rectangular matrix multiplication. Though the library only supports rectangular multiplications. This feature has to be added. The MM multiplication uses the CUDA shared memory with 32×32 threads per block.
C = QSwiftMatlab(‘IData1′,single(A),’IData2′,single(B),’Type’,’MatrixMatrixMul’);
One can compute the gaussian elimination of a matrix with CUDA. The implementation “zero” the lower left triangle of the matrix, which doesnt have to be a square or full rank matrix. The elimination runs on the Y – Dimension. This allows the user to add a column of the right side of the system of equations. One can easily find the solution. I have choosen this implementation because this method is also used in the matrix inversion algorithm. As all methods currrently implemented, the library supports single-precision only.
C = QSwiftMatlab(‘IData1′,single(A),’Type’,’GaussElim’);
As was said before, the gaussian elimination is used for matrix inversion. During first stage of the algorithm, an identity matrix is added to the right of the input matrix. Gaussian elimination in the forward and backward direction follows. The last stage is the normalization, which makes sure, that the initial matrix A becomes the identity E and that the added matrix becomes A-1.
QRef = QSwiftMatlab(‘IData1′,single(A),’Type’,’MatrixInverse’);
As expected, QSwift is also capable of performing the Gram-Schmidt decomposition. As previously stated, the implementation is not ultra fast,but again,there is plenty of space for improvements. As expected, the output consists of two matrices [Q,R] so that the input matrix A = Q*R. Only rectangular matrices are supported by default.
[Q,R] = QSwiftMatlab(‘IData1′,single(A),’Type’,’GramSchmidt’);
This algorithm was discussed in the post, it is the most promising method for matrix tridiagonalization and as wikipedia suggests, its the first spet of the QR algorithm. Though this wasnt tested and the build-in eigenvalue solver still uses the Gram-Schmidt decomposition only.The output matrix is the tridiagonal matrix. Again,only square matrices are supported.
Tria = QSwiftMatlab(‘IData1′,single(A),’Type’,’HauseHolder’);
This implemented solver is capable of returning only real matrices. Work with complex numbers is currently not supported. The convergence criterion of the QR algorithm can be specified, but the total amount of iterations is required. If the convergence criterion is met in lower amount of iterations, the result is displayed without additional processing. QR uses the Gram-Schmidt decomposition and should be generally slower than the Matlab implementation.
EiCuda = QSwiftMatlab(‘IData1′,single(A),’Precision’,1e-3,’NumIter’,NumIter,’Type’,’Eigenvalues’);
Jacobi Iteration Method
This is one of the methods used for solving system of linear equations. It is not as efficient as the successive overrelaxation, but it does its job. In fact setting the omega factor to 1 in SOR yields the Jacobi iteration method. An optional Initial guess can be passed into the method. The closer the guess is to the real result,the faster this method converge.
Qref = QSwiftMatlab(‘Type’,’Jacobi’,’IData1′,A,’IData2′,B,’NumIter’,NumIter,’Precision’,1e-3);
Successive OverRelaxation Iteration Method
This is the most favorite numerical method of solving a system of linear equations. Its the fastest one as well. An Omega factor can help to speedup the computation. Its values lies between <1,2> while the recommended value is about 1.3. It however depends on the particular set. An optional initial guess can be passed to the routine, when nothing is passed,the intial guess is set as a vector of zeros.
QRef = QSwiftMatlab(‘Type’,’Sor’,’IData1′, A ,’IData2′, B ,’NumIter’,NumIter,’Omega’,Omega,’Precision’,Precision);
Finally, if you have some questions or requests, please let me know or write a comment. I am always listening :)
Includes: Matlab demos and simple documentation of usage.
Changes from 1.1.2:
- Added Jacobi Iteration Solver Method
- Added Successive Over Relaxation Method
- Improved Performance of several kernels
- Greatly improved code scaleability
- New memory engine
- Added Vectorized Input
- Bug Fixes
Changes from 1.1.0:
- Bug fixes
- CuBLAS Acceleration added