ello again! Although I usually don’t have much time writing, I do occasionally see an interesting post or question regarding something I like. As I was browsing facebook, a question came to the Matlab forum I visit from time to time.It was not exactly a question to be honest, but a request for aid regarding computing of the Mandelbrot set, which is one of the famous fractal sets and in fact nothing more, that just evaluating some function in the complex plane over and over again until we decide for each point, whenever it lies in the set or it doesn’t. Basically, one defines a starting value for each point in the plane and then iterates a function of a form Z(n+1) =   Z(n)*Z(n) + C. This is the exact formula for computing the Julia set instead, but both the Mandelbrot and Julia sets are very similar in term of computing. Here, C defines a complex constant, which is added in each iteration to point lying somewhere in the complex plane. As it happens, its a numerical iteration-based solving of the set. One could create an infinite iteration loop, but then he would never see the result. It is therefore always a good idea to specify 2 things:

  • Maximum amount of iterations to be processed.
  • Iteration threshold – a function value which when crossed specifies that the point is  / is not a part of the set.

The iteration is usually stopped when any of the above is asserted. Either we have reached maximum amount of iterations or the function value has crossed the Threshold. The complex plane is usually defined from range -2 to 2 on both the real and imaginary axis. This is because values greater than 2 or -2 taken to the power of 2 always tend to go to infinity in this case and there is nothing special on them. To visualize the set, it is therefore necessary to evaluate the function in each point in the area of interest. In our basic case between -2 and 2 on real and -2 and 2 on the imaginary axis. Lets say we have a resolution of 1000 points in both real (x) and imaginary planes (y),this gives us a total of 1000*1000 = 1e6 points to be evaluated in order to visualize the set. Not precisely though as some points lying on the set edges require quite a large amount of iterations, lets say 100 for example. This gives us 1e6*100 = 1e8 complex multiplications and additions in order to visualize the set. This is quite a huge number even for modern multi-core computers.

Main CUDA Kernels


 * @brief Initializes the Julia And MandelBrot Computations 
__global__ void JuliaMandelInitialize(cufftComplex *A,float xStart,float xStop,float yStart,float yStop,int Nx,int Ny) {
	int tix = threadIdx.x + blockIdx.x * blockDim.x;
	int tiy = threadIdx.y + blockIdx.y * blockDim.y;
	if(tix < Nx && tiy < Ny) {
		int tid = tix + tiy * Nx;
		cufftComplex Temp;
		Temp.x = xStart + tix*(xStop - xStart) / (Nx - 1);
		Temp.y = yStart + tiy*(yStop - yStart) / (Ny - 1);
		A[tid] = Temp;

 * @brief Main Julia Kernel
 * @param A: Julia Function Values
 * @param C: Custom Julia Set Complex Constant
 * @param Iterations: Amount of iterations processed for each point
 * @param Threshold: Threshold for computing
 * @param NumIter: Amount of Iterations per kernel call
 * @param Nx X Dimensioon of the Grid
 * @param Ny Y Dimension of the Grid
__global__ void JuliaKernel(cufftComplex *A,cufftComplex C,int *Iterations, float Threshold, int NumIter,int Nx,int Ny){
	int tix = threadIdx.x + blockIdx.x * blockDim.x;
	int tiy = threadIdx.y + blockIdx.y * blockDim.y;
	if(tix < Nx && tiy < Ny) {
		int tid = tix + tiy * Nx;

		// Load from global memory
		cufftComplex LA = A[tid];
		int IterProcessed = Iterations[tid];

		// Chunk Iteration Processing Loop
		for(unsigned int i = 0;i < NumIter;i++) {
			cufftComplex Res = LA * LA + C;
			cufftReal Absolute = abs(Res);

			if (Absolute < Threshold) {
				LA = Res;

		// Upload to Global Memory
		A[tid] = LA;
		Iterations[tid] = IterProcessed;

 * @brief Computes the Mandebrot set
 * @param C - Mandelbrot Constant
 * @param X - Coordinates
 * @param Y - Coordinates
 * @param Iterations - Iterations processed for a given point
 * @param Threshold - Iteration Threshold
 * @param NumIter - Amount of Chunk Iterations
 * @param Nx - Amount of Points in the X - Dimension
 * @param Ny - Amount of points in the Y - Dimension 
__global__ void MandelBrotKernel(cufftComplex *C,cufftReal *X,cufftReal *Y,int *Iterations,float Threshold,int NumIter,int Nx,int Ny) {
	int tix = threadIdx.x + blockIdx.x * blockDim.x;
	int tiy = threadIdx.y + blockIdx.y * blockDim.y;
	if(tix < Nx && tiy < Ny) {
		int tid = tix + tiy * Nx;

		cufftComplex Cl = C[tid];
		cufftReal Xl = X[tid];
		cufftReal Yl = Y[tid];

		int IterProcessed = Iterations[tid];
		for (unsigned int i = 0 ; i < NumIter;i++) {
			cufftReal Xn = Xl * Xl - Yl * Yl + Cl.x;
			cufftReal Yn = 2 * Xl*Yl + Cl.y;
			cufftReal Res = Xn * Xn + Yn * Yn;
			if(Res < Threshold) {
				Yl = Yn;
				Xl = Xn;

		X[tid] = Xl;Y[tid] = Yl;
		Iterations[tid] = IterProcessed;


Now the problem is, that computing the set and displaying the set in Matlab is quite easy, however because of extra overhead of the Matlab language, one can easily run into a situation, where the evaluation of the set in this resolution will take tenths of minutes if not hours for large resolutions such as 2K or 4K. To illustrate the problem: there are 3 loops! The outermost one, which loop across iterations and 2 inner loops looping across the set’s X and Y – dimensions. writing the ultra inefficient will result in 3 for-loops, which will be sequentially executed on a single CPU core and not all available CPU cores. In fact using for-loops is highly not recommended for Matlab (Although sometimes just has to be used). An alternative solution is to use Matlab’s great vectorization technique as Matlab stand in fact for “Matrix Laboratory”. To do the vectorization, we just compute “all” values at once for a single iteration. To do so,we define the Initial value as a matrix and the constant as a matrix as well, when evaluating the function,we use the element-wise operations on matrices and vectors, which are very likely executed on several CPU cores are also very likely optimized with vector instructions such as SSE or AVX. By doing this, we can remove the 2 loops looping across the X and Y dimensions. The last iteration loop just has to stay as the results of the the next iteration is dependent on the previous result and although millions of points can be executed in parallel, their new values can only be computed as soon as their previous function values are known. This is overall known as data-dependency.

For those,who want to take a quick look at the performance and the demo, I have prepared a small demonstration video available on youtube. Unfortunately, the audio stream is the one I had in background while recording the screen and it has been captured as well. It has nothing to do with the demo. For those who wonder what those songs are: #1 Sudha feat. Zoe Johnston – Leche – Thomas Schwartz #2 Taylor Swift – This Love from album 1989.



Update 2018.01.24:

Because some people found my work inspiring,I decided to add some new functionality. First of all,the image can be panned to any position. There is no longer a need to zoom in & out in order to view a specific part of the plane. Furthermore,user may change the colormap of the image. I have added my favorite colormaps (morgenstemning), but new and custom ones can be added easily. At last, the CUDA mexfile now supports Image filtering with a custom Image filter (Kernel). This has been added as an alternative to slow CPU processing and mainly to enhance the image. Especially when high amount of iterations has been requested. There are however still some improvements that may be added in a later release:

  • Switch from Julia Real / Imag to Absolute value and Angle.
  • Use double-precision processing in CUDA.
  • Try to decrease memory usage in order to process higher resolutions.
  • Fix CUDA based Image filter,which now crops the requested image borders.

The last specified problem can likely be solved easily in Visual Studio. It is only visible when the filtering kernel is approximately the same size as the input Image.