BTW I read a lot of papers about optimizing OpenGL over the last days. I see more optimization potential in the following areas (most important first):
1. We trash render states (glEnable() ... glDisable()) too much. Some render state changes are guaranteed to trigger a pipeline validation on the next drawing call which takes several hundred GPU cycles. I'm especially speaking about GL_DEPTH_TEST and GL_STENCIL_TEST. GL_BLEND may be relatively costly too but I didn't benchmark that one.
Best practice is to organize rendering around the states. Example:
- turn on depth test
- turn on culling
- render all solid objects
- turn on blending
- render all transparent objects
- turn off blending
- turn off culling
- turn off depth test
2. Our calls to glPushAttrib() and glPopAttrib() have to go away. glPopAttrib() always triggers a validation. And it is called everytime we draw text.
3. Identifying and eliminating application side bottlenecks so Scourge can execute the rendering loop more often.
A few posts later he outlined a rough plan:
- Potential gains by optimization
- OpenArena benchmark runs with 18-20 fps on an Intel 915GM using the 2.4 X.Org driver. This game is based on the Quake 3 source and I know about a few special optimizations in its engine.
- On the same chipset and driver (and significantly lower resolution) Scourge has 14-15 fps in the indoors and around 8 fps in the outdoors without trees, around 4-5 in cities. I expect to gain another 30-50% frames by optimizing on the OpenGL side and even more in some special cases.
- Rough plan
- Start when the new outdoor engine is in good shape and we have indoors again.
- Streamline the render loop.
- First remove state trashing, start with the GUI code. Move expensive state changes as much up in the code hierarchy as possible to reduce the amount of calls. Ideally, stuff like GL_DEPTH_TEST and GL_CULL_FACE would only be called "globally" in ScourgeView:drawView() between render passes. Might need some refactoring of the renderers so individual passes can be called from outside.
- Expensive state changes: depth test, stencil test, culling, blending. Additionally, blending always halves the fillrate even with nothing to blend, because of one additional buffer read per fragment. Less expensive states: texturing and immediate mode (glBegin() ... glEnd()). Essentially free: alpha test, scissoring, glColor(), glTexCoord(), glVertex().
- Speed up depth buffer.
- It might be worth considering an initial depth-only pass so most or all subsequent passes can be drawn without depth test. Would work best if we had some means to depth-sort map objects in software. Need to investigate that.
- Optimize throughput, part 1
- Need to check whether the CPU is the main bottleneck at that stage, if yes, optimize application side first. Talk with kotk about it.
- If not bottlenecked, consolidate OpenGL calls to send out data in larger bursts. Find some method to generate big triangle strips (more than 2 triangles) without much fuss.
- Continue fixing implicit typecasts in OpenGL calls, but prefer integer/short versions where float is not necessary. Older GPUs are relatively slow with floats as they have to convert them into a fixed point internal format.
- Optimize throughput, part 2
- Eradicate typecasts. Avoid math in loops. Simplify algorithms. Put heavier CPU processing directly after OpenGL calls to maximize parallel execution. Do everything to make the render loop execute more often.
- At this stage, Scourge will be CPU bottlenecked so we can take advantage of multicore.
- Note: OpenGL is NOT threadsafe because of its asynchronous nature. There is no way to tell when a command is actually finished on the GPU. Contrary to common opinion, you also can't count on a predictable pipeline state after swapping buffers. The only way to definitely flush the pipeline is glFinish(), which kills performance and does not belong into non-benchmark code. Thus, OpenGL has to remain in the main thread but we can "outsorce" some of the other code.
- Good outsorcing candidates are region loading/generation in the background and handling of mouse coordinates so the pointer stays responsive at slow frame rate.
- More eye candy
- Depending on the amount of CPU bottleneck, we can then increase scene complexity without a performance hit. A further small optimization might also be to make better use of multitexturing as it is much faster than blending.
Eihrul, former lead developer of Sauerbraten, hung out for a few weeks on #scourge and shared his thoughts:
- The render engine is currently full of hooks into other parts of the code which makes it very hard to reorganize the render sequence.
- A speedup might be achieved by switching from display lists to vertex arrays or VBOs (where available) and tag the objects to draw by the render states they require, and then draw the scene one tag after the other. This minimizes render state changes.
- For higher-class hardware, vertex/fragment programs could be faster. This isn't hard to do and doesn't require an additional render path, as the driver will simply skip the fixed pipeline if FPs/VPs are enabled, and vice versa.
- Switching to .md5 as the model format of choice will put less load on the vertex stage and conserve graphics memory, as animation is done via simple transforms and no interpolation vertices need to be calculated. Problem: No free .md5 models and modeling tools are available.
Lordtoran has verified that the render engine indeed hooks into the other code to no end. Reorganizing the render loop and making the engine portable or even replaceable requires decimating these hooks, which is way too much work for a single developer so the project would have to focus exclusively on this issue for a while. As a result, the code would be cleaner, easy to optimize and (with a little effort) very, very fast. For low-end devices the engine could be replaced with a 2D one that uses prerendered imagery from the 3D version.