Old CUDA programmers (Like me) are used to use the cudaMemcpy() functions to transfer data to and from the GPU. Copying the data over PCI-Express is a necessary operation during each computation. I have written countless of functions and algorithms, that include some sort of cudaMemcpy() functions. The only exception could be some benchmarking algorithm ,that runs on random data. Neverthless nVdia CUDA Runtime disposes of several other methods for data copy, which also includes the cudaMemcpyAsync() or cudaMemcpyToSymbol() {My favorite ;) } . Since CUDA 6 and nVidia Kepler architecture however, there is an option to use the unified memory.

The best introduction to Unified memory is a fact, that the user no longer has to distinguish between the Host and the Device address space. Simply one pointer to rule them all, one pointer to find them, one pointer to bring them all …. and in the darkness bind them :D Now honestly you can use A single pointer in both the device and host functions. The driver automatically cares of the data copying as well, so nobody has to care about data copying. This is on one hand a great feature, because the code is easily readable. On the other hand, people should be aware, that there IS somewhere behind the secret data copy over the PCI-Express anyway. Its just not visible anyway.


The functions to use:

  • cudaMallocManaged((void**)&UnifiedPointer, size_t SomeSize, cudaMemAttachGlobal); 
  • cudaFree(UnifiedPointer);

Once the memory is allocated this way,the user can launch and combine any amount of CPU/GPU functions that can modify the array’s data. I am looking forward to use this feature as soon as I get new laptop with a greater compute capability than CC 2.1 (My good old Fermi O:) ).

  • *Unified addressing is available only in the 64-bit versions.
  • ** Use cudaGetDeviceProp() and query for Prop.unifiedAddressing to check for support of this feature.

Thats it for today, cheers :)