Android Game Development Tips&Tricks

Kill the bugs

Kill the bugs

It has been several months since I first thought I should post here some of basic tips and tricks on android development, specifically game development. Sadly I have almost no free nowadays. But here it is, a new blog entry I hope I can expand into a little series. Everything posted here comes from my own experience developing ShadingZen, an open source 3D Engine for Android

This is all about java, dalvik and you. We well talk about java development and what little tricks you can do to make your game shine (ok maybe not shine, that depends on your graphics people, but to go faster and avoid slow downs). If you follow me on twitter you may recall me saying dalvik GC is slow…oh well, I was wrong, it is not dalvik but the mobile environment what is slow. If you don’t believe me, create a jni example and put a delete inside a loop, then profile.

Lastest versions of dalvik have done a huge leap towards a good-performance-all-around-vm. For gaming it was critical to release a non blocking garbage collector, prior to version 2.3 (Gingerbread) used to pause the whole application for more than 10ms. Let’s do some simple math: for a 30fps animation you need to rendering one frame in 33ms aprox. If the garbage collectors hits in, your animation will suffer.

Ok, so the garbage collector works as intended. so why do still need to use tricks and put more thoughts on the whole thing? Well….it is a mobile, all those Hires bitmaps need to be deleted and created during your application, if you do this very often, it is going to be slow, no matter what. This leads us to the number one tip, don’t allocate memory if you don’t need to.

1. Measure

A claim that something is slow need some proof. You need to start benchmarking your gameloop from day 0 of development and be sure nothing breaks your desired frame rate. A good resource for benchmarking is Caliper.

2. Avoid Allocating Memory

This tip may look like too simple, but it is true, you don’t have many spare memory, don’t allocate objects that you really don’t or have a more basic counterpart that would do the job.

3. Don’t release memory

– What?! – Yes, don’t release memory, it is slow, and furthermore, you will probably need to use it again, or a partial copy of it. Avoid having a huge for lop with hundred of iterations where you code creates and discards objects, that’s going to hurt you later, when the GC starts collecting that memory.

There is a very smart way to condense tip 2 and tip 3 into just one big idea, object factories. This will be may next tip, for the second part of this blog entries series. I will post some code for you all (although you can go right now to ShadingZen repository at GitHub and start taking a look!).


– There is great talk by Google’s Advocate Chris Pruett on some basic ideas, take a look at them: Writting Real-Time Games for Android

– Caliper. A benchmarking tool.

Open code and you


Moving some of my projects to the wild, I mean to GitHub, has made me realize what others already knew, much of the software we use everyday is open source or has some roots there. Be honest, peeking to someone else’s software is great. As to my own experience I started learning programming as a child reading the source code of the Doom Editing Utilities (and that probably shifted and twisted my understanding of computers for my entire life). This concepts of seeing others work and learning from it is a hidden fact that affects our life in for example every gadget we use (read Android/iOS). But not only to learn but to understand what a piece of software does, its source code is the ultimate documentation. If you are using a third party library, ask for its source code and peek at it when needed. At Video Stream Networks we made a similar approach, we had our domain logic code written in C++ every time we had to tune it for every customer’s workflow it made us feel miserable. After we moved our domain logic into Casper, a Domain Specific Language for the media and broadcast world, our customer can not only peek at what our app does but understand and fine tune it themselves.

Thanks to GitHub and the Open Source Community, ShadingZen Engine is slowly maturing. The GitHub repo received a pull request with some great changes. The project now has Maven build, the structure is more clear with all the examples and documentation centralized in just one repository and the eclipse requirement has been removed (This last point allowed me to try out IntellIJ IDEA which so far I find as a really good java IDE).

Yes, open source level editors as well a robust designed and extensible engine was behind the success of DOOM, the very same that made some of us programmers.

Towards ShadingZen 1.0beta2

Development of ShadingZen is approaching version 1.0 beta 2 and a new minor update has been rolled out and ready to be cloned/forked by you at ShadingZen’s GitHub repository.

The primary goal for this milestone (v1.0 beta 2) is to provide better documentation, ranging from API documentation to useful examples for the wiki.

Secondary goals are to improve performance, mainly in areas where we can use object pools to avoid garbage collection frame rate drops. In fact, RenderTasks have been refactored and now use a global shared pool manager which creates and reuses RenderTask objects. This gives a performance boost but increases memory usage.

ShadingZen is a 2D/3D Engine for Android OpenGL ES 2.0 and is open source under the MIT License.

Android realtime performance tips

Embedding programming has never been easier after the introduction of modern mobile APIs like Android SDK and iOS SDK. Nevertheless for realtime applications new areas for potential bottlenecks may arise as those extra layers add more complexity to your application.

A clear example may happen with the Dalvik GC (Garbage Collector), which coupled with a realtime 2D/3D Engine generating many objects for each frame, will (for sure) showcase frame drops when the GC hits in. This is hard to solve as Java makes really easy to create new objects that encapsulate your required functionality but hides from you how and when the memory will be collected. Hey! it creates objects everywhere, for iterators, enums, sorting alogrithms…I personally think Dalvik needs some improvement in memory management areas but meanwhile we just need to avoid those problems and minimize them as possible.

Dont create objects! No, seriously, don’t create new objects in your game loop. Use object pools as much as possible. This is an area ShadingZen engine is improving and is one of the reasons you should always create new actors using the “spawn” method.

Don’t call your own methods, use the objects properties within the objects code and use methods to access functionality from outside. Also avoid using getters and setters, but pack functionality in just one method call instead of using the getted property from outside. For example if you want to make an actor explode you may need to compute explosion velocity and actor final destination from outside. Instead create a “makeExplode” method for that and compute everything within the object code. Dalvik makes calling methods slow.

If you are using OpenGL ES avoid changing states, pack drawing calls sharing the same state and run them at once.

DDMS is your friend. I know how much you hate its awkward interface but you need it, profile often!

Check this paper as it contains basic guidelines to avoid performance bottlenecks in your realtime applications:

Performance is an area the next version 1.0-beta2 of ShadingZen is receiving much love.

ShadingZen 3D Engine open sourced!

I’m showing you the code!

I have decided to open source my 2D/3D engine for Android and it is currently available at GitHub under the MIT License [put random reason here].

The goals behind ShadingZen is to offer a simple framework on which you can build mobile games easily, but without leaving behind performance reasons like stressing out multicore mobile CPUs found in modern phones/tablets. I have borrowed some ideas found at Cocos2D that I just find really useful, like Actions and Transitions.

Go clone it!

I’m using my spare time to create some HOWTOs and examples. I would also like to write down some core concepts of the engine. For more info keep an eye to future changes at the GitHub wiki here.

Some notes I’m preparing for future documentation sections.

GPU Compute Slides

Some interesting slides on GPU compute aimed to 3D engine programmers: It goes beyond rendering and how it can be useful in other situation to offload work to the GPU. It also contains some smart ideas on how to improve overall algorithm performance. Worth a read.

Development of Kids 3D Cube

I’m happy to announce that I have release my newest creation, Kids 3D Cube for Android:

For this game a new 2D/3D engine optimized for multi core phones was created, targeting OpenGL ES 2.0, so It won’t run on old Android versions (below 3.2.).

A few interesting things about the 3d engine: It employs an smart task scheduler to parallelize work among all cores. It also has a lazy resource loader and few other interesting functionalities that I believe are a good starting point for upcoming titles.

I have to confess that mobiles GPUs are challenging for the limited set of resources, but high rewarding when your little phone starts showing up your glorious shaders on screen! 😉 A tip: minimize bandwidth usage, keep us much on GPU side and reuse it as much as possible. This is a big win on low-end phones.

I’m actively working on solving a few issues and adding new content to the game, stay tunned!

You can visit Kids 3D Cube fan page at Facebook here:

Android MediaRecorder with Camera

Our animated gif and stopmotion maker application (Gifagram) relays heavily on the Android camera support. Two good starting points for Camera and Video recording on android are:

The first link is the Android SDK MediaRecorder class and the second one is a lengthy example.

Unfortunelly SDK information on camera and video is poor, and hidden behavior is quite undocumented.

For instance, MediaRecorder.setProfile(‘HIGH’) sets an invalid video size in Samsung Galaxy SII (1080 height, instead of 1088 as hardware requires). The same code on HTC Wildfire throws an exception although on other HTC phones, everything works great. This is heavily hardware dependent which can run you crazy.

Video recording using a Camera object

The following code snipplet setups the camera:

boolean setupVideoRecording(int use_profile){
    	boolean ret = true;
    		if(recorder != null){
    			recorder = null;

    		if(camera != null){
    		} else{
    			camera =;


    		CamcorderProfile profile;
    		if(use_profile == 0){
    			profile = CamcorderProfile.get(CamcorderProfile.QUALITY_LOW);  			
    			profile = CamcorderProfile.get(CamcorderProfile.QUALITY_HIGH);

			Camera.Parameters params = camera.getParameters();
			List sizes = params.getSupportedPreviewSizes();

			//List formats = params.getSupportedPreviewFormats();
			Size optimalSize;
			optimalSize = getOptimalPreviewSize(sizes, profile.videoFrameWidth, profile.videoFrameHeight);

			params.setPreviewSize(optimalSize.width, optimalSize.height);

			setCameraDisplayOrientation((Activity)context, 0, camera);


			recorder = new MediaRecorder();


			if(use_profile == 2){
				recorder.setVideoSize(optimalSize.width, optimalSize.height);
				//recorder.setVideoSize(profile.videoFrameWidth, profile.videoFrameHeight);
			} else{

    	} catch(Exception e){
    		String message = e.getMessage();
	    	ret = false;

    	return ret;

We pass a use_profile integer to choose some hardware dependent code. But for almost everything the use_profile == 2 is the right one for you.

And the following prepares the media recorder. Both functions can be called one after the other to completely setup the camera for video recording.

boolean initializeMediaRecorder(){
			return true;
    	} catch(Exception e){
    		String message = e.getMessage();
	    	return false;

This code is quite simple but I hope it can be the foundation of something more elaborated for someone. Please feel free to leave a comment if you have a tip that you would like to share with everyone.

Understanding Modern GPUs (III): Command And Setup Logic

Up to this points we have reviewed the common APIs, the host side of the communication between the CPU and the GPU, and how they interact and synchronize. In this new post we will start by exploring the GPU’s citizen in charge of being its interface to the outside, the Command Processor (CP).


Barts architecture diagram (AMD)


We learnt how the CPU sends command to the CPU, with state changes, instructions and data, but this is codified somehow and must be interpreted. A CP keeps track of states within the GPU, updates host mapped registers and signals interrupts to inform the CPU.

In fact, CPs are (afaik) a micro-processors embedded into the GPU capable of doing most of the tasks traditionally handled by software at the driver. Contains an internal memory, can do complex logic, arithmetic operations and so on. Its capable of managing multiple command buffers, keep track of what is sent down into the GPU or update fences once the command stream has reached them.

Her first task is decoding the commands in order to feed other components. It’s also responsible of reading and writing host memory and managing device memory. Managing states is a complex task and in some cases, in order to maintain integrity, a partial pipeline flush is issued before proceeding. This is the worst case as it can serialize everything. This is a world itself and very vendor specific.


The Command Processor manages some dedicated fixed-function logic, a Vertex Assembler (VA), Tessellator, Geometry Assembler (GA), Rasterizer/Interpolator. These elements are responsible of feeding the processing cores with data, they talk with the Thread Scheduler (named GigaThread by NVIDIA) and issue computing tasks in blocks. Fermi architecture seems to be a bit different as these fixed function logic blocks start to become more and more a bottleneck . NVIDIA has opted to duplicated some of these logic or rearrange them into their Stream Processors which would allow many concurrent operations.

Althought I would like to focus these posts for High Performance Computing, a simple explanation on these setup blocks is interesting. For each vertex in a stream there is a set of associated attributes (like normal, binormals…) that need to be fetched (besides the vertex position) and assembled into a block before further processing. This is the primary task for the Vertex Assembler. As the attribute list grows, the performance decreases as more and more data needs to be fetched from memory before it can be processed.

The Tesselator is a highly programable logic block which can perform patch tessellation and feed (back) the pipeline with new vertices and primitives. The Geometry Assembler fetches primitive information along with adjacency information and sends down all these information for further processing inside a Geometry Shader. The Rasterizer emits fragments to be processed by Fragment Shaders. Interpolation used to be a monolithic a operation done inside the rasterizer but nowadays most of the work is done directly inside the Fragment Shader. With the fragment barycentric coordinates you can easily interpolate each defined attribute directly inside the shader (via patching the user shader).

All these setup blocks feed the Stream Processors which are the core of the GPUs. We will review how they work in a few days.

Understanding Modern GPUs (II): Drivers and Command Ring

This is the second post on Understanding Modern GPUs where we will review the driver, the data flow to the GPU and what modules are involved. In my previous post we talked about the software front end of the GPU, this post and the following ones are going to be more hardware related.


Pick any API of your choice, let it be OpenGL, DirectX, OpenCL any mix of these or other APIs (yes, you can mix OpenGL and OpenCL), they implement many functionality in user space. For example in Windows Vista/7 the Windows Display Driver Model ( you can find the following diagram very self explanatory:

User space and kernel space display drivers model

Each application using a graphics API loads into their private process memory a set of driver/API functionality which is not shared among other processes. There you can find the command buffer, where all your APIs calls are transformed and stored sequentially before being transfered to the side of the driver residing in kernel model. The per process command buffer is vendor dependent and may contain the actual commands given  to the GPU. Also this is the point where shaders are compiled in runtime (in user space) and the driver may inject custom code (therefore patching the shader) to implement specific functionality that doesn’t translate directly to hardware. This part is very specific to the GPU being used.

Being the GPU a resource shared among different processes, there must be a mechanism to ensure not only that the commands are executed in order, but that the GPU can be used by all those processes and that there is no data corruption. This is done by the device driver executing in kernel model. There you can find the scheduler and the final Command Ring Buffer that is used to talk to the GPU. At the other side of the Command Ring Buffer is the Command Processor (CP) which reads from the stream, decodes the command and feeds the Threads/Stream Scheduler (we will talk about this in other post).

Simplified Ring Buffer (Command Buffer)

The kernel driver scheduler reads from each individual (process) command buffer and moves them to the DMA command buffer. In fact, that would be a waste of resources. Current GPUs contain a DMA controller and a MMU. The first one lets the GPU talk directly to host RAM to fetch and write data without CPU intervention. The MMU virtualizes GPU/host memory and offers some memory protection. For example the Fermi architecture contains 2 DMA engines to get a two-way simultaneous transfers. Another possibility is transferring data between GPUs without CPU intervention. It’s important to notice that those transfers are executed in parallel with the main command buffer, thus adding another level of parallelism to exploit.

We send a command to the GPU to fetch data from a memory region using its DMA engines, instead of transferring directly,this way we can create different command buffers, one for each user-space driver and let the GPU fetch them.


Settings a Command Buffer in DirectX:

Basically through the command buffer you set some states in the GPU, set it to fetch data and issue execution orders. In older days user APIs had a big drawback, you had to specify you primitives by commands directly to the API, for example glBegin/glEnd. Those harmfull calls are now removed from OpenGL ES for example, as they performance killers on modern graphics cards. You can think of the CPU and the GPU as two threads that communicate through the Command Ring Buffer. Its a ring (FIFO) that is filled by the CPU and read by the GPU until its drained. If the ring is empty (write and read pointers are equal) the GPU stalls and waits until has something to do. In this case you are probably CPU bounded. If the CPU fills the entire buffer and has to wait for some free space, you are GPU bounded.

Ring Buffer and its Control Structure. Source (AMD):

The above figure shows the Host and the Graphics Controller (Command Processor) connected through the Ring Buffer (RB). The RB is initialized with a fixed buffer size and both Write and Read Pointers are set to zero (empty buffer). The driver adds packets into the RB and updates the Write Pointer register inside the device. When the device reads packets, updates the Read Pointer. Updating both pointers incurs in some overhead that can be mitigated by only updating these register when some a block of data has been consumed (by grouping packets in blocks) instead of doing so for each packet. This also needs more logic in both sides to avoid writing when the RB is full (more info here  although a bit out-dated).

This command stream adds some other synchronization issues that must be taken care of. Imagine the following: you create a huge data array which is going to be processed, but once the GPU has finish fetching from the main memory region we would like to update it as soon as possible with new data. How does the CPU know that some commands have been processed so that we can update the array? Remember that this is implemented by a pointing the GPU to fetch it from memory but meanwhile, both the GPU and the CPU can work in parallel to this fetch. The solution is in fact very simple. There are some command types embedded into the command stream called fences (I found this patent by VIA about the matter 😉 Those fences are read by the GPU which updates some register so that the CPU knows that we are up to that point in the stream.

What happens to your OpenCL kernel or vertex shader up to this point? The kernel code was compiled to an intermediate language by the user-space driver (PTX for CUDA devices or AMD IL for AMD devices for example). The code is then compiled to the specific hardware by the driver and passed to the GPU. Since some GPUs may have missing functionality or needs different steps to compute some function, the code needs to be targeted to that hardware running on the computer. For example double floating poins operations may need additional passes to obtain the desired accuracy on hardware lacking a dedicated double precision float processor. Some GPU architectures sacrifice IEEE compliance while others lack double-precision altogether.

Some of those concepts we have been talking about can be seen in the OpenCL API (remember I told you there was some kind of direct mapping between this API and how the hardware actually works from a logical view). You create a command buffer using clCreateCommandQueue, you enqueue a read from device memory using clEnqueueReadBuffer, you run a computing kernel using clEnqueueNDRangeKernel and so on.

On the next episode, we will talk about the Command Processor and some setup logic. Stay tunned!