oday,I am going to show you how to use your GPU for some basic school examples of GPS signal processing. Although I am right now only interested in GPU computing, I will have to post some very basic info about GPS.

First of all,you need to know how a GPS works (At least a little). There are some satellites on the sky orbiting the earth at an altitude of about 20 300km and a speed of 3,8Km/s. Each satellite transmits a signal coded with a unique sequence of PRN (pseudorandom noise) derived form the gold code. In short, its a sequence of 1023 pseudorandom numbers, that consist only of -1 & +1. This sequence has a high autocorrelation peak and very good crosscorrelation properties between the other codes. This property allow us to use the same transmission frequency for all 32 satellites and distinguish them via PRN sequence (This is known as Code Division Multiple Access – CDMA). There are currently 37 sequences,however only 32 of them are assigned to satellites. This sequence is also known as ranging code, due to the fact, that we can calculate the delay of the signal propagation based on searching for the autocorrelation peak. Now the GPS signal (after a few modulations) is transmitted on a fixed frequency in L1 (about 1,5GHz) and due to the fact, that the satellite moves,we receive not the signal of the same frequency, but with a different frequency. This new frequency can be higher or lower than the transmitted one. This depends on the mutual movement of the user and the satellite and you can easily search wiki or whatever for “Doppler Effect”.

Now when we want to calculate our position,we need to know the time it takes the signal to reach our receiver. Knowing this, we obtain a spherical surface of our estimated position. By calculating 2 delays from 2 different satellites,we obtain a circle of our estimated posittion. Theoretically,we need 4 satellites in view (4th for the time correction as we assume,that the clock on GPS receiver  is not as accurate as the clock on the GPS satellite ), however this is the real minimum and the real number of needed satellites is around 6.

Our example signal is modeled as follows:



  • x(t) – modeled GPS signal
  • n(t) – white noise
  • c(t) – prn ranging code
  • A – signal amplitude
  • Epsilon – GPS signal delay
  • Fi – starting phase
  • fd  -doppler offset

Our Sampling frequency is 6.600.000Hz and our input data are the complex envelope of the signal. Our task is to find a GPS delay and a GPS Doppler Frequency Offset caused by the satellite movement. The suggested interval for searching the Doppler offset is -6000 Hz to + 6000Hz with a step of 100Hz.

To solve this,we will use the Maximum Likelihood (ML) formula:




  • X[k] is our complex GPS input signal
  • c[k] is our PRN signal
  • Tsa is the sampling period (1/fsa)
  • chart4 is our frequency search step
  • o ~ Doppler Frequency Estimate
  • m ~GPS Delay Estimate

The code is very simple and straightforward  in matlab to implement. Everything seems to be fine,you have double precision and you dont have to care about complex multiply or whatever. There is however something,you will definitely not like. Its the computation time. On my own desktop with a core i5-750 processor running on 3.2GHz,it takes about 35 minutes to calculate where the maximum is located. You can make some optimization directly in matlab,However I am not a great fan of matlab anyway because it indexes from 1 unlike c/c++ where index starts from 0, so I just decided to translate the code to CUDA and use my Geforce GTX 560ti with 384 Stream Processors to solve this task. So lets take a look at the core of the matlab implementation first:

for row = 1:length(GPS.signal)                 %TIME
    for col = 1:(interv/dfd)+1                 %FREQ
        % FOR instead of SUM
        res = 0;
        for k = 1:1:length(GPS.signal)
            res = res + GPS.signal(k)*exp(-j*2*pi*(col*dfd-6100)*1/GPS.fsa*k)*prn(k - row + length(GPS.signal));
        SearchMat(row,col) = res;

As can you see, the code is very easy to undestand. We have 2 outer cycles that sweep across time and frequency domain and 1 additional cycle, that represents a sum operation. The problem with matlab is however that you dont have to care about complex multiply and precision since matlab uses default double precision (As I wrote a few lines aerlier) …. So now onto the equation: We know that GPS.signal is a complex sequence of numbers and that a prn is a real sequence of numbers. So basically what I was thinking about at first is that I will have to implement a basic device operations for Complex multiply and complex addition:

//Device Complex Multiply
__device__ cufftComplex C_Mul(cufftComplex a,cufftComplex b){
	cufftComplex temp;
	temp.x = a.x*b.x - a.y*b.y;
	temp.y = a.y*b.x + b.y*a.x;
	return temp;

//Device Complex Add
__device__ cufftComplex C_Add(cufftComplex a, cufftComplex b){
	cufftComplex c;
	c.x = a.x + b.x;
	c.y = a.y + b.y;
	return c;

And as you can see I am very lazy person, so have I imported the cufftComplex structure in order to easy my code. The complex addition is ultra easy to understand and if you are not sure what the C_mul() function does. Just take a paper and multiply A*B yourself. Where A= a +bi and B = c +di. Do not forget that i*i = -1! :) … Now the second thought that came to my mind is that we can use the euler formula and write:


 So in our example: 


Also note that we will use index -6000 since c++ starts at index ,,0″ as I already mentioned. Now you can ask … alright we know how to calculate everything,but OMG,how can we paralelize such a code? Actually … Its damn easy. And I can say generally, that porting a code to cuda is somewhat easy once you have at least one for cycle in your code host code. Fortunately, we have the luxury of 2 for cycles! Thats pretty awesome, since we can launch a 2D grid on the GPU and get rid of both of them! :D And an addtional recommendation is to launch 1024 x 1 threads per block. Doing so, you can specify the y dimension as the number of blocks in y dimension and you dont have to check against dimenzionality of the thread in y dimension in ther main kernel :)

Ok,so now when everything is pretty clear,we start with the main host code:

        //Copy input
	HANDLE_ERROR(cudaEventRecord(start, 0));
	HANDLE_ERROR(cudaMemcpy(dev_GPS, host_GPS, Len*sizeof(cufftComplex), cudaMemcpyHostToDevice));
	HANDLE_ERROR(cudaMemcpy(dev_prn, host_prn, PrnLen*sizeof(float), cudaMemcpyHostToDevice));
	HANDLE_ERROR(cudaMemcpy(dev_add, host_add, 10 * sizeof(int), cudaMemcpyHostToDevice));

	//Define dimensions
	dim3 THREADS_PER_BLOCK(1024, 1, 1);
	dim3 BLOCKS_PER_GRID(Len / 1024 + 1, FDInterv / FreqDif + 1, 1);

	//Launch Main Kernel
	for (int i = 0; i < (Len / Swiftstep + 1); i++){
		GPS << <BLOCKS_PER_GRID, THREADS_PER_BLOCK >> >(dev_Out, dev_GPS, dev_prn, dev_add,i);
		cout << "calculated: " << i*Swiftstep << " Rows" << endl;

	//Launch Abs Kernel
	ABS << <BLOCKS_PER_GRID,THREADS_PER_BLOCK >> >(dev_Out, dev_res_abs, dev_add);

	//Copy output
	HANDLE_ERROR(cudaMemcpy(host_res_abs, dev_res_abs, Len*numRows*sizeof(float), cudaMemcpyDeviceToHost));
	HANDLE_ERROR(cudaEventRecord(stop, 0));
	HANDLE_ERROR(cudaEventElapsedTime(&elapsedTime, start, stop));

	//Look for correlation maximum on the CPU
	float max = 0.0f;
	int Tind = 0; int Find = 0;
	for (int i = 0; i < numRows; i++){
		for (int j = 0; j < Len; j++){
			int index = i*Len + j;
			if (host_res_abs[index] > max){
				max = host_res_abs[index];
				Tind = j;
				Find = i;


As you already know, each host code consist of some memcopy operations, althought I have added an event to know how fast we have processed the calculation. In our case, we are copying the GPS signal along with the prn sequence and some other params (fsa,signal length …) specified in dev_add to the GPU. Then we define the dimensions of the Grid and launch our kernel (Forgett the for cycle now). You can see, that we are also launching another kernel called ABS. this due to the fact, that we need the absolute value of the complex number. In case you need to take a look at the ABS kernel function:

//Just make absolute value out of complex numbers
__global__ void ABS(cufftComplex *in, float *out,int* add){
	int tix = threadIdx.x + blockIdx.x*blockDim.x;   //row
	int tiy = threadIdx.y + blockIdx.y*blockDim.y;   //col
	int tid = tiy*gridDim.x*blockDim.x + tix;        //Global Id
	int N1 = add[0];                                 // GPS Signal Length

	//Can we process the data
	if (tix < N1){
		int index = N1*tiy + tix;
		float res = sqrtf(in[index].x * in[index].x + in[index].y * in[index].y);

		//Save result to Global Memory
		out[index] = res;


Well thats is … we just calculate the sum of the square roots of the real and imaginary data,calculate an sqrtf and return values (as float) to the GPU memory. After we have computed the ABS value, we can copy data back to the CPU memory and search for the peak on the host. Although you can search for the peak on GPU,its much easier to search for it on the host. This operation is very fast compared to the equation calculation itself,so there will be only a very small delay,of which you dont have to care about.

Now onto the most interesting part: The GPU code itself:

//Main GPS Kernel
__global__ void GPS(cufftComplex *Output,cufftComplex *GPS,float*Prn, int *add,int step)
	int tix = threadIdx.x + blockIdx.x*blockDim.x;   // row
	int tiy = threadIdx.y + blockIdx.y*blockDim.y;   // col
	int tid = tiy*gridDim.x*blockDim.x + tix;        // Global Id
	int N1 = add[0];                                 // GPS Signal Length
	float Tsa = powf(add[1],-1.0f);                  // Tsa
	int FreqDif = add[2];                            // Frequency search difference
	int SwiftStep = add[3];                          // How many calculations per kernel call

	//Can we calculate
	if (tix < N1){
		int index = N1*tiy + tix;
		cufftComplex res;

		if (step == 0){
			res.x = 0.0f;
			res.y = 0.0f;
			//Load from memory
			res = Output[index];

		// What are we calculating?
		int i = step*SwiftStep;
		int imax = i + SwiftStep;
		if (imax > N1){ imax = N1; };

		//Main computing routine
		for (i; i < imax; i++){
			cufftComplex temp,temp1;
			float t1 = -2.0f * M_PI * i * Tsa * ((FreqDif * tiy) - 6000);
			temp.x = cosf(t1) * Prn[N1+i-tix];
			temp.y = sinf(t1) * Prn[N1+i-tix];

			temp1 = C_Mul(GPS[i], temp);
			res = C_Add(res,temp1);

		//Save result to Global Memory
		Output[index] = res;


First of all, I will have to return to the host code and tell you why we were launching this kernel in cycle. As you can guess, the computation keeps the GPU very busy (at about 99% load) and eventought its peaking its maximum GFLOPS, we cant compute everything withing 2 seconds, which is the default kernel-time out on windows. In short, you cant write a code, that will will keep the GPU busy for more than 2 sec (Or whatever that value is). Thus we need to split the execution into multiple parts. Thats why we define a variable called SwiftStep on the host and set it to some small number (20+-). This number defines how much work we will process in 1 kernel launch so that windows will never restart graphics drivers. If you look at the matlab code, note that there is an extra for cycle that represents summation. In fact, what we are doing is that we are just splitting this summation. Lets take a closer look on the code:

 if(step ==0 ) means that this is the first step and therfore, we dont have any part results yet. Thus we set res = 0.
 However if step != 0 means that we are now processing the summation from “some” index and therefore we need to know the previous number, which is stored in the global memory, so we load it and continue with summation.
Then we have to compute the start and end index (i,imax) and check that imax is not bigger than our total number of input samples (GPS.length).
Once we know where to start and where to stop,we can finally proceed with summation in the GPU for cycle that runs SwiftStep times. This constant is defined in the host code and transfered in the dev_add array.
 At first, we compute argument for both cos() and sin() parts of the equation. Alternatively,we can remove minus from cos(). We can do this because cos() is an even function. You can try this on your own. Then we just scale both the real and the imaginary part with prn().
Complex multiply with the GPS signal follows.
 Complex addtion to the result variable follows.
 Save result back to memory.

If this was the last step, ABS kernel launch will follow.

If this wasnt the last step, this result will become the input for the next step.

Mine GPU finishes the work in about  7s …  And I believe its pretty cool :)

I just have to note, that I was experiencing some real problems at first. The t1 was filled with zeros across all threads and I didnt know why. So I started printing each variable and its value and found out,that Tsa was zero everytime. By that time Tsa was defined as 1/fsa, which is correct of course. But this operation maybe too risky in C++/CUDA and resulted in zero (note that the denominator – fsa  – is 6 600 000). I have solved this problem by using the powf(x,-1.0f) function. The results of the calculation was tested again in matlab and proved to be almost the same with just minor differences caused by the fact that we were using single precison (float) during the execution.


Here is the source code for the whole project: VS2013CZSCUDA

  •  VisualStudio 2013 Project
  • Testing script for matlab
  • Whole code for saving and calculating the same in matlab
  • GPS input files called signal_GPS_05.mat & signal_GPS_04.mat
  • Images of the output

I do not own any rights on signal_GPS_04/05 files.

They were a part of a project on FEL.CVUT.CZ.

I will delete those files on request.

EDIT 2.2.2015: Added a release version :) Here

The Release version is doing the same work about 6 times faster ;)