This is the second post on Understanding Modern GPUs where we will review the driver, the data flow to the GPU and what modules are involved. In my previous post we talked about the software front end of the GPU, this post and the following ones are going to be more hardware related.
USER SPACE AND KERNEL SPACE COMMAND FLOW
Pick any API of your choice, let it be OpenGL, DirectX, OpenCL any mix of these or other APIs (yes, you can mix OpenGL and OpenCL), they implement many functionality in user space. For example in Windows Vista/7 the Windows Display Driver Model (http://msdn.microsoft.com/en-us/library/ff570593.aspx) you can find the following diagram very self explanatory:
User space and kernel space display drivers model
Each application using a graphics API loads into their private process memory a set of driver/API functionality which is not shared among other processes. There you can find the command buffer, where all your APIs calls are transformed and stored sequentially before being transfered to the side of the driver residing in kernel model. The per process command buffer is vendor dependent and may contain the actual commands given to the GPU. Also this is the point where shaders are compiled in runtime (in user space) and the driver may inject custom code (therefore patching the shader) to implement specific functionality that doesn’t translate directly to hardware. This part is very specific to the GPU being used.
Being the GPU a resource shared among different processes, there must be a mechanism to ensure not only that the commands are executed in order, but that the GPU can be used by all those processes and that there is no data corruption. This is done by the device driver executing in kernel model. There you can find the scheduler and the final Command Ring Buffer that is used to talk to the GPU. At the other side of the Command Ring Buffer is the Command Processor (CP) which reads from the stream, decodes the command and feeds the Threads/Stream Scheduler (we will talk about this in other post).
Simplified Ring Buffer (Command Buffer)
The kernel driver scheduler reads from each individual (process) command buffer and moves them to the DMA command buffer. In fact, that would be a waste of resources. Current GPUs contain a DMA controller and a MMU. The first one lets the GPU talk directly to host RAM to fetch and write data without CPU intervention. The MMU virtualizes GPU/host memory and offers some memory protection. For example the Fermi architecture contains 2 DMA engines to get a two-way simultaneous transfers. Another possibility is transferring data between GPUs without CPU intervention. It’s important to notice that those transfers are executed in parallel with the main command buffer, thus adding another level of parallelism to exploit.
We send a command to the GPU to fetch data from a memory region using its DMA engines, instead of transferring directly,this way we can create different command buffers, one for each user-space driver and let the GPU fetch them.
Settings a Command Buffer in DirectX: http://msdn.microsoft.com/en-us/library/ff569747(v=VS.85).aspx
Basically through the command buffer you set some states in the GPU, set it to fetch data and issue execution orders. In older days user APIs had a big drawback, you had to specify you primitives by commands directly to the API, for example glBegin/glEnd. Those harmfull calls are now removed from OpenGL ES for example, as they performance killers on modern graphics cards. You can think of the CPU and the GPU as two threads that communicate through the Command Ring Buffer. Its a ring (FIFO) that is filled by the CPU and read by the GPU until its drained. If the ring is empty (write and read pointers are equal) the GPU stalls and waits until has something to do. In this case you are probably CPU bounded. If the CPU fills the entire buffer and has to wait for some free space, you are GPU bounded.
The above figure shows the Host and the Graphics Controller (Command Processor) connected through the Ring Buffer (RB). The RB is initialized with a fixed buffer size and both Write and Read Pointers are set to zero (empty buffer). The driver adds packets into the RB and updates the Write Pointer register inside the device. When the device reads packets, updates the Read Pointer. Updating both pointers incurs in some overhead that can be mitigated by only updating these register when some a block of data has been consumed (by grouping packets in blocks) instead of doing so for each packet. This also needs more logic in both sides to avoid writing when the RB is full (more info here http://developer.amd.com/gpu_assets/R5xx_Acceleration_v1.2.pdf although a bit out-dated).
This command stream adds some other synchronization issues that must be taken care of. Imagine the following: you create a huge data array which is going to be processed, but once the GPU has finish fetching from the main memory region we would like to update it as soon as possible with new data. How does the CPU know that some commands have been processed so that we can update the array? Remember that this is implemented by a pointing the GPU to fetch it from memory but meanwhile, both the GPU and the CPU can work in parallel to this fetch. The solution is in fact very simple. There are some command types embedded into the command stream called fences (I found this patent by VIA about the matter 😉 http://www.patentgenius.com/patent/7755632.html). Those fences are read by the GPU which updates some register so that the CPU knows that we are up to that point in the stream.
What happens to your OpenCL kernel or vertex shader up to this point? The kernel code was compiled to an intermediate language by the user-space driver (PTX for CUDA devices or AMD IL for AMD devices for example). The code is then compiled to the specific hardware by the driver and passed to the GPU. Since some GPUs may have missing functionality or needs different steps to compute some function, the code needs to be targeted to that hardware running on the computer. For example double floating poins operations may need additional passes to obtain the desired accuracy on hardware lacking a dedicated double precision float processor. Some GPU architectures sacrifice IEEE compliance while others lack double-precision altogether.
Some of those concepts we have been talking about can be seen in the OpenCL API (remember I told you there was some kind of direct mapping between this API and how the hardware actually works from a logical view). You create a command buffer using clCreateCommandQueue, you enqueue a read from device memory using clEnqueueReadBuffer, you run a computing kernel using clEnqueueNDRangeKernel and so on.
On the next episode, we will talk about the Command Processor and some setup logic. Stay tunned!