Tuesday, April 14, 2015

The OpenGL Impedance Mismatch

As graphics hardware has changed from a fixed function graphics pipeline to a general purpose parallel computing architecture, mid-level graphic APIs like OpenGL don't fit the execution model of the actual hardware as well as they used to.

In my previous post, I said that the execution of GL state change is deferred so that the driver can figure out what you'r really trying to do and efficiently change all state at once.

This has been true for a while. For example, older fixed function and partly programmable GPUs might have one set of register state to control the entire fixed-function raster operations.  Here's the R300 (e.g. the Radeon 9700).

  • The blend function and sources share a single register, but
  • The alpha and RGB blend function/sources are in different registers (meaning a single glBlendFuncSeparate partly updates both).
  • Alpha-blend enable shares a register with the flag to separate the blender functions. (Why the hardware doesn't just always run separate and let the driver update both sides of the blender is a mystery to me.)
  • Some GL state actually matches the register (e.g. the clear color is its own register).
So the match-up between imaginary ideal GL pipeline and the hardware isn't perfect. But in the end, the fit is actually pretty good:

  • Fixed function tricks like blending and stenciling are enabled by setting registers on the GPU.
  • Uniforms for a given shader live on the chip while the shader is executing.
  • The vertex fetcher is fixed functionality that is set up by register.
There's a lot written about AMD's Graphics Core Next (GCN) architecture, the GPU inside the Radeon 7900 and friends.  Since GCN GPUs are in both the X-Box One and Playstation 4 and AMD is reasonably loose with chip documentation and disassembling compilers, we know a lot about how the hardware really works.  And the fit...is not so snug.

  • Shader constants come from memory (this has been true for a while now) - this is a good fit for a UBOs but a bad fit for "loose uniforms" that are tied to the shader object.  On the GPU, the shader object and uniforms are fully separable.
  • Vertex fetch is entirely in the shader - the driver writes a pre-amble for you.  Thus changing the vertex alignment format (but not the base address) is a shader edit!  Ouch.
  • For shaders that write to multiple render targets, OpenGL lets us remap them via glDrawBuffers, but this export mapping is part of the fragment shader, so that's a shader edit too.
Those shader edits are particularly scary - this is a case where we (the app) think we're doing something orthogonal to the shading pipeline (e.g. just setting up a new VBO) but in practice, we're getting a full shader change.

In fact, the impedance match makes this even worse: if we're going to have any hope of changing state quickly, the driver has to track past combinations of vertex layout, MRT indirection, and the actual GLSL linked program, and cache the "real" shader that backs this combined state.  Each time we change the front-end vertex fetch format or back-end MRT layout, the driver has to go see if that combination exists in cache.

The back-end MRT layout isn't the worst problem because we are hopefully not going to change rendering targets that frequently.  But the vertex format is a real mess; every call to glVertexAttribPointer potentially invalidates the vertex layout; the driver can either try to heavily check state change, or regenerate the shader front-end; both options stink.

You can see OpenGL trying to track the moving target of the hardware in the extensions: GL_ARB_vertex_array_object was made part of core OpenGL 3.0 and ties up the entire vertex fetch plus base pointer in a single "object" for quick recall.  But we can see that this is now a pretty poor fit; half of the state that the VAO covers (the layout) is really part of the shader, while the other half (the actual address of the VBO plus offset) is separate.*

A newer extension, GL_ARB_vertex_attrib_binding, separates the vertex format (which is part of the shader in hardware) from the actual data location; it was made part of OpenGL 4.3. I don't know how good of a fit this is; the vertex attribute binding leaves the data stride out of the "expensive" format binding.  (My guess is that the intended implementation is to specify the data stride as a constant in a constant buffer somewhere.) In theory with this extension, only glVertexAttribFormat requires an expensive shader patch, and applications can change VBO sources without calling it.

If there's an executive summary here, it's that OpenGL as an API has never been a perfect representation of what the hardware is doing, but as the hardware moves toward general purpose compute devices that work on buffers of memory, the pipeline-and-state model fits less and less.

In my next posts I'll take a look at Metal and Mantle - these new APIs let us take the red pill and see how deep the rabbit hole goes.


* I am of the opinion that VAOs were a mistake from day one.  VAOs are mutable to allow them to be 'layered' on top of existing code the way VBOs were, and even if they weren't, the data location of the VBO is mutable at the driver level (because the VBO may at the time of draw be in VRAM or system memory, and may require a change to the memory map of the CPU that the GPU holds to draw, or it may require a DMA copy to move it to RAM).  The result is that binding a VAO doesn't let you skip the tons of validation and synchronization needed to actually start drawing once the base pointers have been moved.

OpenGL State Change Is Deferred

This is totally obvious to developers who have been coding high performance OpenGL for years, but it might not be obvious to newer developers starting with OpenGL or OpenGL ES, so...

In pretty much any production OpenGL driver, the real 'work' of OpenGL state change is deferred - that work is executed on the next draw call (e.g. glDrawElements or glDrawArrays).

This is why, when you profile your code, glBindBuffer and glVertexPointer appear to be "really fast" and yet "glDrawArrays" is using a ton of CPU Time.

The work of setting up the hardware for GL state is deferred because often the state cannot be set up until multiple calls come in.

Let's take as an example, vertex format.  You do this:
glBindBuffer(my_buffer);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 32, (char *) 0);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 32, (char *) 12);
glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, 32, (char *) 24);
The way this is implemented on modern GPUs is to generate a subroutine or pre-amble code for your vertex shader that "executes" the vertex fetch based on these stride rules.


There's no point in generating the shader until all of the vertex format is known; if the driver went and patched your shader after the first call (setting up attribute 0) using the old state of attributes 1 and 2, all of that work is wasted and would be redone when the next two glVertexAttribPointer calls come in.

Furthermore, the driver doesn't know when you're done.  There is no glDoneScrewingAroundWithVertexAttribPointer call.

So the driver does the next best thing - it waits for a draw call.  At that point it goes "hey, I know you are done changing state because this draw call uses what you have set now."  At that point it goes and makes any state change that is needed since the last draw call.

What this means is that you can't tell how "expensive" your state change is by profiling the code doing the state change.  The cost of state change when you call it is the cost of recording for later what needs to be done, e.g.
void glBlendFunc(GLenum sfactor, GLenum dfactor)
{
   context * c = internal_get_thread_gl_context();
   c->blend.sfactor = sfactor;
   c->blend.dfactor = dfactor;
   c->dirty_bits |= bit_blend_mode;
}
In other words, the driver is just going to record what you said to the current context and make a note that we're "out of sync" state-wise.  The draw call does the heavy lifting:
void glDrawArrays(GLenum mode, GLint first, GLsizei count)
{
   context * c = internal_get_thread_gl_context();
   if(c->dirty_bits & bit_blend_mode)
   {
     /* this is possibly slow */
     sync_blend_mode_with_hardware(&c->blend);
   }
   /* more check and sync */
   c->dirty_bits = 0;
   /* do actual drawing work - this isn't too slow */
}
On Apple's OpenGL implementation, the stack is broken into multiple parts in multiple dylibs, which means an Instruments trace often shows you subroutines with semi-readable names; you can see draw calls updating and synchronizing state.  On Windows the GL stack is monolithic, stripped, and often has no back-trace info, which makes it hard to tell where the CPU is spending time.

One final note: the GL driver isn't trying to save you from your own stupidity.  If you do this:
for(int i = 0; i < 1000; ++i)
{
   glEnable(GL_BLEND);
   glDrawArrays(GL_TRIANGLES, i*12, 12);
}
Then every call to glEnable is likely to make the blend state 'dirty' and every call to glDrawArrays is going to spend time re-syncing blend state on the hardware.

Avoid calling state changes that aren't needed even if they appear cheap in their individual function call time - they may be "dirtying" your context and driving up the cost of your draw calls.