Understanding Modern GPUs (III): Command And Setup Logic

Up to this points we have reviewed the common APIs, the host side of the communication between the CPU and the GPU, and how they interact and synchronize. In this new post we will start by exploring the GPU’s citizen in charge of being its interface to the outside, the Command Processor (CP).


Barts architecture diagram (AMD)


We learnt how the CPU sends command to the CPU, with state changes, instructions and data, but this is codified somehow and must be interpreted. A CP keeps track of states within the GPU, updates host mapped registers and signals interrupts to inform the CPU.

In fact, CPs are (afaik) a micro-processors embedded into the GPU capable of doing most of the tasks traditionally handled by software at the driver. Contains an internal memory, can do complex logic, arithmetic operations and so on. Its capable of managing multiple command buffers, keep track of what is sent down into the GPU or update fences once the command stream has reached them.

Her first task is decoding the commands in order to feed other components. It’s also responsible of reading and writing host memory and managing device memory. Managing states is a complex task and in some cases, in order to maintain integrity, a partial pipeline flush is issued before proceeding. This is the worst case as it can serialize everything. This is a world itself and very vendor specific.


The Command Processor manages some dedicated fixed-function logic, a Vertex Assembler (VA), Tessellator, Geometry Assembler (GA), Rasterizer/Interpolator. These elements are responsible of feeding the processing cores with data, they talk with the Thread Scheduler (named GigaThread by NVIDIA) and issue computing tasks in blocks. Fermi architecture seems to be a bit different as these fixed function logic blocks start to become more and more a bottleneck . NVIDIA has opted to duplicated some of these logic or rearrange them into their Stream Processors which would allow many concurrent operations.

Althought I would like to focus these posts for High Performance Computing, a simple explanation on these setup blocks is interesting. For each vertex in a stream there is a set of associated attributes (like normal, binormals…) that need to be fetched (besides the vertex position) and assembled into a block before further processing. This is the primary task for the Vertex Assembler. As the attribute list grows, the performance decreases as more and more data needs to be fetched from memory before it can be processed.

The Tesselator is a highly programable logic block which can perform patch tessellation and feed (back) the pipeline with new vertices and primitives. The Geometry Assembler fetches primitive information along with adjacency information and sends down all these information for further processing inside a Geometry Shader. The Rasterizer emits fragments to be processed by Fragment Shaders. Interpolation used to be a monolithic a operation done inside the rasterizer but nowadays most of the work is done directly inside the Fragment Shader. With the fragment barycentric coordinates you can easily interpolate each defined attribute directly inside the shader (via patching the user shader).

All these setup blocks feed the Stream Processors which are the core of the GPUs. We will review how they work in a few days.