From time to time I’m asked if I could explain “how graphics work”, which is obviously a very broad question. Instead of answer the most known answer, that is, how to draw primitives using vertices, matrices, textures and so on, I would like to use my blog to write down all my current limited knowledge about the low level side of computer graphics. How the API (OpenGL, DirectX, etc) talks to the hardware and how data is moved around to and from these computing monsters (GPUs). I said the word “computing” on purpose, today GPUs are designed to handle massive amounts of vector processing which makes them suitable for High Performance Computing (HPC) besides drawing your current AAA title.
Btw, I have found another blogs explaining similar stuff here: http://fgiesen.wordpress.com/2011/07/01/a-trip-through-the-graphics-pipeline-2011-part-1/
Some explanations given are very speculative thoughts or things I have read in the past which may not be totally correct or are completely wrong. Vendors don’t want to expose their architecture but lot of things can be learnt reading their APIs, documents and third-party investigations (http://www.icare3d.org/GPU/CN08). OpenCL is a good start as many concepts through the API seem to map directly on how the hardware implements those operations. You may want to check Intel, AMD and NVIDIA documents on their sites. I recommend reading the following papers to get more in-depth information about GPUs:
3D GRAPHICS PIPELINE INTRODUCTION
Before anything, I want to make a dirty introduction to a common 3d graphics pipeline and I will later make a short introduction to OpenCL/HPC. If you need more information, the concepts explained in this section may help you direct your question to Google more easily.
Before the age of current highly programmable GPUs there was a set of APIs with fixed pipelines. Those pipelines had some stages that are still present somehow in current 3D engines but it’s up to the user to implement/port such functionality or ignore it altogether. A common set of stages of a modern graphics pipeline (very simplified) can be found in the following figure:
You start with an stream of vertices and a set of shaders, small program that run inside the GPU and transform those vertices (for example by applying some space transformations and a projection to obtain their position in screen space). This is quite oversimplified, modern APIs add more stages after the Programmable Vertex Processor which let you emit new vertices or primitives but all these falls outside the scope of this writing. With the vertices stream the pipeline builds some primitives (points, lines and vertices) and moved into the rasterization stage where the user can define a Fragment Shader to run for each emitted fragment. Those fragments may after some z-test and some final decisions end written into the image buffer. This logic is highly programmable through the use of different shaders.
The point behind all is how the GPU architecture exploits the fact that the input and output data can be isolated from other packets inside the data stream, that is, GPUs exploit data level parallelism. For us, those data packets are vertices and pixeles. If you enforce this in your design you can operate with hundreds of vertices and pixels in parallel.
This paradigm shift, from sequential processing into parallel SIMD processing can be exploited to do any kind of repetitive computation besides transforming vertices and calculating pixel values. OpenCL is an API to access the GPU as a computational device which runs small programs (kernels) that operates over some data. The idea behing OpenCL is to run a kernel over a point in a problem domain assuming that each point is highly isolated from others points and thus can be executed in parallel (like the vertices and pixeles in our Graphics Pipeline introduction). The API defines an execution model and a memory model that maps to the imposed contrains found in the GPUs. A running program operating over a point in a domain is called work-item (like a thread) . You must define a N-dimension domain space of work-items and group a set of work-items into work-groups. Those work-groups can share a common memory region private to that work-group.
To maximize throughput you want to load all stream processor inside the GPU and fully utilize other areas of the die, like the texture units/memory fetcher because they work in parallel with the ALU. Furthermore, round-trips to host memory are slow and must be minimized. We will see this in more details in future posts.
The reader can find a nice introduction to OpenCL here: http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx
In my next post (which is almost ready) we will talk about more low level stuff: how the APIs talk to the driver an the hardware, and how a GPU is orchestrated to run in parallel with the CPU.