Android Game Development Tips&Tricks

Kill the bugs

Kill the bugs

It has been several months since I first thought I should post here some of basic tips and tricks on android development, specifically game development. Sadly I have almost no free nowadays. But here it is, a new blog entry I hope I can expand into a little series. Everything posted here comes from my own experience developing ShadingZen, an open source 3D Engine for Android

This is all about java, dalvik and you. We well talk about java development and what little tricks you can do to make your game shine (ok maybe not shine, that depends on your graphics people, but to go faster and avoid slow downs). If you follow me on twitter you may recall me saying dalvik GC is slow…oh well, I was wrong, it is not dalvik but the mobile environment what is slow. If you don’t believe me, create a jni example and put a delete inside a loop, then profile.

Lastest versions of dalvik have done a huge leap towards a good-performance-all-around-vm. For gaming it was critical to release a non blocking garbage collector, prior to version 2.3 (Gingerbread) used to pause the whole application for more than 10ms. Let’s do some simple math: for a 30fps animation you need to rendering one frame in 33ms aprox. If the garbage collectors hits in, your animation will suffer.

Ok, so the garbage collector works as intended. so why do still need to use tricks and put more thoughts on the whole thing? Well….it is a mobile, all those Hires bitmaps need to be deleted and created during your application, if you do this very often, it is going to be slow, no matter what. This leads us to the number one tip, don’t allocate memory if you don’t need to.

1. Measure

A claim that something is slow need some proof. You need to start benchmarking your gameloop from day 0 of development and be sure nothing breaks your desired frame rate. A good resource for benchmarking is Caliper.

2. Avoid Allocating Memory

This tip may look like too simple, but it is true, you don’t have many spare memory, don’t allocate objects that you really don’t or have a more basic counterpart that would do the job.

3. Don’t release memory

– What?! – Yes, don’t release memory, it is slow, and furthermore, you will probably need to use it again, or a partial copy of it. Avoid having a huge for lop with hundred of iterations where you code creates and discards objects, that’s going to hurt you later, when the GC starts collecting that memory.

There is a very smart way to condense tip 2 and tip 3 into just one big idea, object factories. This will be may next tip, for the second part of this blog entries series. I will post some code for you all (although you can go right now to ShadingZen repository at GitHub and start taking a look!).


– There is great talk by Google’s Advocate Chris Pruett on some basic ideas, take a look at them: Writting Real-Time Games for Android

– Caliper. A benchmarking tool.


Development of Kids 3D Cube

I’m happy to announce that I have release my newest creation, Kids 3D Cube for Android:

For this game a new 2D/3D engine optimized for multi core phones was created, targeting OpenGL ES 2.0, so It won’t run on old Android versions (below 3.2.).

A few interesting things about the 3d engine: It employs an smart task scheduler to parallelize work among all cores. It also has a lazy resource loader and few other interesting functionalities that I believe are a good starting point for upcoming titles.

I have to confess that mobiles GPUs are challenging for the limited set of resources, but high rewarding when your little phone starts showing up your glorious shaders on screen! 😉 A tip: minimize bandwidth usage, keep us much on GPU side and reuse it as much as possible. This is a big win on low-end phones.

I’m actively working on solving a few issues and adding new content to the game, stay tunned!

You can visit Kids 3D Cube fan page at Facebook here:

Understanding Modern GPUs (III): Command And Setup Logic

Up to this points we have reviewed the common APIs, the host side of the communication between the CPU and the GPU, and how they interact and synchronize. In this new post we will start by exploring the GPU’s citizen in charge of being its interface to the outside, the Command Processor (CP).


Barts architecture diagram (AMD)


We learnt how the CPU sends command to the CPU, with state changes, instructions and data, but this is codified somehow and must be interpreted. A CP keeps track of states within the GPU, updates host mapped registers and signals interrupts to inform the CPU.

In fact, CPs are (afaik) a micro-processors embedded into the GPU capable of doing most of the tasks traditionally handled by software at the driver. Contains an internal memory, can do complex logic, arithmetic operations and so on. Its capable of managing multiple command buffers, keep track of what is sent down into the GPU or update fences once the command stream has reached them.

Her first task is decoding the commands in order to feed other components. It’s also responsible of reading and writing host memory and managing device memory. Managing states is a complex task and in some cases, in order to maintain integrity, a partial pipeline flush is issued before proceeding. This is the worst case as it can serialize everything. This is a world itself and very vendor specific.


The Command Processor manages some dedicated fixed-function logic, a Vertex Assembler (VA), Tessellator, Geometry Assembler (GA), Rasterizer/Interpolator. These elements are responsible of feeding the processing cores with data, they talk with the Thread Scheduler (named GigaThread by NVIDIA) and issue computing tasks in blocks. Fermi architecture seems to be a bit different as these fixed function logic blocks start to become more and more a bottleneck . NVIDIA has opted to duplicated some of these logic or rearrange them into their Stream Processors which would allow many concurrent operations.

Althought I would like to focus these posts for High Performance Computing, a simple explanation on these setup blocks is interesting. For each vertex in a stream there is a set of associated attributes (like normal, binormals…) that need to be fetched (besides the vertex position) and assembled into a block before further processing. This is the primary task for the Vertex Assembler. As the attribute list grows, the performance decreases as more and more data needs to be fetched from memory before it can be processed.

The Tesselator is a highly programable logic block which can perform patch tessellation and feed (back) the pipeline with new vertices and primitives. The Geometry Assembler fetches primitive information along with adjacency information and sends down all these information for further processing inside a Geometry Shader. The Rasterizer emits fragments to be processed by Fragment Shaders. Interpolation used to be a monolithic a operation done inside the rasterizer but nowadays most of the work is done directly inside the Fragment Shader. With the fragment barycentric coordinates you can easily interpolate each defined attribute directly inside the shader (via patching the user shader).

All these setup blocks feed the Stream Processors which are the core of the GPUs. We will review how they work in a few days.

Understanding Modern GPUs (I): Introduction


From time to time I’m asked if I could explain “how graphics work”, which is obviously a very broad question. Instead of answer the most known answer, that is, how to draw primitives using vertices, matrices, textures and so on, I would like to use my blog to write down all my current limited knowledge about the low level side of computer graphics. How the API (OpenGL, DirectX, etc) talks to the hardware and how data is moved around to and from these computing monsters (GPUs). I said the word “computing” on purpose, today GPUs are designed to handle massive amounts of vector processing which makes them suitable for High Performance Computing (HPC) besides drawing your current AAA title.

nVidia Fermi GF100 Graphics Processor

Btw, I have found another blogs explaining similar stuff here:

Some explanations given are very speculative thoughts or things I have read in the past which may not be totally correct or are completely wrong. Vendors don’t want to expose their architecture but lot of things can be learnt reading their APIs, documents and third-party investigations ( OpenCL is a good start as many concepts through the API seem to map directly on how the hardware implements those operations. You may want to check Intel, AMD and NVIDIA documents on their sites. I recommend reading the following papers to get more in-depth information about GPUs:


Before anything, I want to make a dirty introduction to a common 3d graphics pipeline and I will later make a short introduction to OpenCL/HPC. If you need more information, the concepts explained in this section may help you direct your question to Google more easily.

Before the age of current highly programmable GPUs there was a set of APIs with fixed pipelines. Those pipelines had some stages that are still present somehow in current 3D engines but it’s up to the user to implement/port such functionality or ignore it altogether. A common set of stages of a modern graphics pipeline (very simplified) can be found in the following figure:

Simplified Graphics Pipeline

You start with an stream of vertices and a set of  shaderssmall program that run inside the GPU and transform those vertices (for example by applying some space transformations and a projection to obtain their position in screen space). This is quite oversimplified, modern APIs add more stages after the Programmable Vertex Processor which let you emit new vertices or primitives but all these falls outside the scope of this writing. With the vertices stream the pipeline builds some primitives (points, lines and vertices) and moved into the rasterization stage where the user can define a Fragment Shader to run for each emitted fragment. Those fragments may after some z-test and some final decisions end written into the image buffer. This logic is highly programmable through the use of different shaders.

The full Shader Model 5.0 pipeline found in DIrectX

The point behind all is how the GPU architecture exploits the fact that the input and output data can be isolated from other packets inside the data stream, that is, GPUs exploit data level parallelism. For us, those data packets are vertices and pixeles. If you enforce this in your design you can operate with hundreds of vertices and pixels in parallel.


This paradigm shift, from sequential processing into parallel SIMD processing can be exploited to do any kind of repetitive computation besides transforming vertices and calculating pixel values. OpenCL is an API to access the GPU as a computational device which runs small programs (kernels) that operates over some data. The idea behing OpenCL is to run a kernel over a point in a problem  domain assuming that each point is highly isolated from others points and thus can be executed in parallel (like the vertices and pixeles in our Graphics Pipeline introduction). The API defines an execution model and a memory model that maps to the imposed contrains found in the GPUs. A running program operating over a point in a domain is called work-item (like a thread) . You must define a N-dimension domain space of work-items and group a set of work-items into work-groups. Those work-groups can share a common memory region private to that work-group.

OpenCL memory model

To maximize throughput you want to load all stream processor inside the GPU and fully utilize other areas of the die, like the texture units/memory fetcher because they work in parallel with the ALU. Furthermore, round-trips to host memory  are slow and must be minimized. We will see this in more details in future posts.

The reader can find a nice introduction to OpenCL here:

In my next post (which is almost ready) we will talk about more low level stuff: how the APIs talk to the driver an the hardware, and how a GPU is orchestrated to run in parallel with the CPU.