Totalview® for HPC User Guide : PART V Using the CUDA Debugger : Chapter 27 CUDA Debugging Model and Unified Display : The TotalView CUDA Debugging Model
The TotalView CUDA Debugging Model
The address space of the Linux CPU process and the address spaces of the CUDA threads are placed into the same share group. Breakpoints are created and evaluated within the share group, and apply to all of the image files (executable, shared libraries, and CUDA ELF images) in the share group.
That means that a breakpoint can apply to both the CPU and GPU code. This allows setting breakpoints on source lines in the host code that are then planted in the CUDA images at the same location once the CUDA kernel starts.
Consider a Linux process consisting of two Linux pthreads and two CUDA threads. (A CUDA thread is a CUDA context loaded onto a GPU device.) Figure 264 illustrates how TotalView would group the Linux and CUDA threads.
 
Figure 264 – TotalView CUDA debugging model
The Linux host CUDA process
A Linux host CUDA process consists of:
A Linux process address space, containing a Linux executable and a list of Linux shared libraries.
A collection of Linux threads, where a Linux thread:
Is assigned a positive debugger thread ID.
Shares the Linux process address space with other Linux threads.
A collection of CUDA threads, where a CUDA thread:
Is assigned a negative debugger thread ID.
Has its own address space, separate from the Linux process address space, and separate from the address spaces of other CUDA threads.
Has a "GPU focus thread", which is focused on a specific hardware thread (also known as a core or "lane" in CUDA lingo).
The above TotalView CUDA debugging model is reflected in the TotalView user interface and command line interface. In addition, CUDA-specific CLI commands allow you to inspect CUDA threads, change the focus, and display their status. See the dcuda entry in the TotalView for HPC Reference Guide for more information.