Sunday, February 28, 2010

One More On VBOs - glBufferSubData

So if you survived the timing of VBO updates (or rather, my speculations on what is possible with VBO updates), now you're in a position to ask the question: how fast might glBufferSubData be? In particular, developers like myself are often astonished when glBufferSubData does things like block.

In a world before manual synchronizing of VBOs (via the 3.0 buffer management APIs or Apple's buffer range extensions) we can now see why a sub-data buffer on a streamed VBO might perform quite badly.

The naive code goes something like this:
  1. Fill half the buffer with buffer sub-data.
  2. Issue a draw call to that half of the buffer.
  3. Flip which half of the buffer we are using and go back to step 1.
In other words, double buffering by dividing the buffer in half, or treating it like a ring buffer.

This implementation is going to perform terribly. T sub-data call is going to block until the previous draw call has completed, even though they use opposite halves of the buffer, and we'll lose all of our concurrency. Let's see if we can understand why.

If we go to respecify a VBO in AGP memory using glBufferSubData while that VBO is in progress, glBufferSubData must block; it can't rewrite the buffer until the last draw finishes because we would see the new vertices, not the old, or maybe half and half. In order for the "fill" to complete, the driver would have to be able to determine that the pending draws and the new fill are completely disjoint.

There are two reasons why the driver might not be able to figure this out:
  1. You've drawn using glDrawElements, and thus the actual part of the vertex VBO you draw from is determined by the index table. The cost of figuring out the "extent" of this draw is to process all of the indices. The cure is worse than the disease. Any sane driver is going to simply assume that any part of the VBO could be used.
  2. Let's assume you use glDrawRangeElements to tell the driver that you're really only going to use half the VBO. Even then, the structure to mark "locked" regions would be a complex one - a series of draws over overlapping regions would require a complex data structure. For this one special case, you're asking the drivers to replace a simple time-stamp based lock (e.g. this VBO is locked until this many commands have executed) with a dynamic range marking structure. If I were a driver writer I'd say "let's keep it simple and not eat this cost on all VBOs."
I think it's safe to assume that some implementations (and all if you use glDrawElements) are simply going to mark the entire VBO as in use until the draw happens, and thus the partial rewrite is going to block as if there was a conflict, even if there was not.

Can we do anything about this? Besides falling back to an "orphaned" approach where we get a fresh buffer each time, our alternative is to use the more exact APIs from ARB_map_buffer_range or APPLE_flush_buffer_range. With these APIs we can map only the part of the VBO we know is not in use, with the unsynchronized bit set to avoid blocking because the other half is pending. We can use flush explicit to then flush only the areas we modified. (With the 3.0 APIs we can also use the discard range option to simply say "we are rewriting what we map".)

Of course, this technique isn't without peril - all synchronization is up to the client. The main danger is an over-run: your app is so fast that it needs to modify a range that the GL isn't done with - we made it all the way around our ring buffer. Probably the safest way to cope with this is to put explicit fences in place to wait until the last dependent draw call that we issued is finished.

Double-Buffering Part 2 - Why AGP Might Be Your Friend

In my previous post I suggested that to get high VBO vertex performance in OpenGL, it's important to decouple pushing the next set of vertices from the GPU processing the existing ones. A naively written program will block when sending the next set of vertices until the last one goes down the pipe, but if we're clever and either orphan the buffer or use the right flags, we can avoid the block.

(My understanding is that orphaning actually gets you a second buffer, in the case where you want to double the entire buffer. With manual synchronization we can simply be very careful and use half the buffer each frame. Very careful.)

Now I'm normally a big fan of geometry in VRAM because it is, to put it the Boston way, "wicked fast". And perhaps it's my multimedia background popping up, but to me a nice GPU-driven DMA seems like the best way to get data to the card. So I've been trying to wrap my head around the question: why not double-buffer into VRAM? This analysis is going to get into the highly speculative - the true answer I think is "the devil is in the details, and the details are in the driver", but at least we'll see that the issue is very complex, double-buffering into VRAM has a lot of things that could go wrong, so we should not be surprised if when we tell OpenGL that we intend to stream our data it gives us AGP memory instead.*

Before we look at the timing properties of an application using AGP memory or VRAM, let's consider how modern OpenGL implementations work: they "run behind". By this I mean: you ask OpenGL to draw something, and some time later OpenGL actually gets around to doing it. How much behind? Quite possibly a lot. The card can run behind at least an entire frame, depending on implementation, maybe two. You can keep telling the GPU to do more stuff until:
  1. You hit some implementation defined limit (e.g. you get 2 full frames ahead and the GPU says "enough!"). Your app blocks in the swap-backbuffer windowing system call.
  2. You run out of memory to build up that outstanding "todo" list. (Your app blocks inside the GL driver waiting for command buffers - the memory used to build the todo list.)
  3. You ask the OpenGL about something it did, but it hasn't done it. (E.g. you try to read an occlusion query that hasn't finished and block in the "get' call.)
  4. You ask to take a lock on a resource that is still pending for draw. (E.g. you do a glMapBuffer on a non-orphaned VBO with outstanding draws, and you haven't disabled sync with one of the previously mentioned extensions.)
There may be others, but I haven't run into them yet.

Having OpenGL "run behind" is a good thing for your application's performance. You can think of your application and the GPU as a reader-writer problem. In multimedia, our top concern would be underruns - if we don't "feed the beast" enough audio by a deadline, the user hears the audio stop and calls tech support to complain that their expensive ProTools rig is a piece of junk. With an OpenGL app, underruns (the GPU got bored) and overruns (the app can't submit more data) aren't fatal, but they do mean that one of your two resources (GPU and CPU) are not being fully used. The longer the length of the FIFO (that is, the more OpenGL can run behind without an overrun) the more flexibility we have to have the speed of the CPU (requesting commands) and the GPU (running the commands) be mismatched for short periods of time.

An example: the first thing you do is draw a planet - it's one VBO, the app can issue the command in just one call. Very fast! But the planet has an expensive shader, users a ton of texture memory, and fills the entire screen. That command is going to take a little time for the GPU to finish. The GPU is now "behind." Next you go to draw the houses. The houses sit in a data structure that has to be traversed to figure out which houses are actually in view. This takes some CPU time, and thus it takes a while to push those commands to the GPU. If the GPU is still working on the planet, then by the time the GPU finishes the planet, the draw-house commands are ready, and the GPU moves seamlessly from one task to the other without ever going idle.

So we know we want the GPU to be able to run behind and we don't want to wait for it to be done. How well does this work with the previous posts double-buffer scheme? It works pretty well. Each draw has two parts: a "fill" operation done on the CPU (map orphaned buffer, write into AGP memory, unmap) and a later "draw" operation on the GPU. Each one requires a lock on the buffer actually being used. If we can have two buffers underneath our VBO (some implementations may allow more - I don't know) then:
  • The fill operation on frame 3 will wait for the draw operation on frame 1.
  • The fill operation on frame 4 will wait for the draw operation on frame 2.
  • The draw operation on frame N always waits for the fill operation (of course).
This means we can issue up to two full frames of vertices. On the third frame (if frame one is still not finished) only then might we block. That's good enough for me.

If the buffer is going to be drawn from VRAM, things get trickier. We now have three steps:
  • "fill" the system RAM copy. Fill 2 waits on DMA 1.
  • "DMA" the copy from system RAM to VRAM. DMA 2 waits on fill 2 and draw 1.
  • "draw" the copy from VRAM. Draw 1 waits on DMA 1.
Now we can start to see why the timing might be worse if our data is copied to VRAM. That DMA transfer is going to have to happen after the last draw (so the VRAM buffer is available) and before the next fill (because we can't fill until the data has been safely copied). It is "sandwiched" and it makes our timing a lot tighter.

Consider the case where the DMA happens right after we finish filling the buffer. In this case, the DMA is going to block on the last draw not completing - we can't specify frame 2 until frame 1 draw is mostly done. That's bad.

What about the case where the DMA happens really late, right before the draw really happens. Filling buffer 2 is going to block taking a lock until the previous frame 1 DMA completes. That's bad too!

I believe that there is a timing that isn't as bad as these cases though: if the OpenGL driver can schedule the DMA as early as possible once the card is done with the last draw, the DMA ends up with timing somewhere in between these two cases, moving around depending on the actual relationship between GPU and CPU speed.

At a minimum I'd summarize the problem like this: since the DMA requires both of our buffers (VRAM and system) to be available at the same time, the DMA has to be timed just right to keep from blocking the CPU. By comparison, a double-buffered AGP strategy simply requires locking the buffers.

To complete this very drawn out discussion: why would we even want to stream out of VRAM? As was correctly pointed out on the OpenGL list, this strategy requires an extra copy of the data - our app writes it, the DMA engine copies it, then the GPU reads it. (With AGP, the GPU reads what we write.) The most compelling case that I could think of, the one that got me thinking about this, is the case where the streaming ratio isn't 1:1. We specify our data per frame, but we make multiple rendering passes per frame. Thus we draw our VBO perhaps 2 or 3 times for each rewrite of the vertices, and we'd like to only use bus up once. A number of common algorithms (environment mapping, shadow mapping, early Z-fill) all run over the scene graph multiple times, often with the assumption that geometry is cheap (which mostly it is).

But this whole post has been pretty much entirely speculative. All we can do is clearly signal our intentions to the driver (are we a static, stream, or dynamic draw VBO) and orphan our buffers and hope the driver can find a way to keep giving us buffers rapidly without blocking, while getting our geometry up as fast as possible.

* We might want to assume this and then be careful about how we write our buffer-fill code so that it is efficient in uncached write-combined memory: we want to fill the buffer linearly in big writes and not read or muck around with it.

Wednesday, February 24, 2010

Double-Buffering VBOs

One of the tricky aspects of the OpenGL API is that it specifies what an implementation will do, but it doesn't specify how fast it will do it. Plenty of forum posts are dedicated to OpenGL applications developers trying to figure out what the "fast path" is (e.g. what brew of calls will make it through the implementation in the least amount of time). ATI and NVidia, for their part, drop hints in a number of places as to what might be fast, but sadly they don't have enough engineers to simply teach every one of us, one on one, how to make our apps less atrocious.

One more bit of background: I don't know squat about Direct3D. I have never worked on Direct3D applications code, I have never used the API, and I couldn't even list all of the classes. I only became aware of D3D's locking APIs recently when I found some comparisons between OGL and D3D when it comes to buffer management. So whatever I say about D3D, just assume it's wrong in subtle ways that are important but hard to detect.

If you only want to draw a mesh, but never change it, life is easy.
  1. Create a static-draw VBO.
  2. Fill it with geometric goodness with glMapBuffer or glBufferData.
  3. Draw it many times.
  4. Hilarity ensues.
Things become more tricky if your VBO has to change per frame. First there's the obvious cost: you're going to burn some host-to-graphics-card bandwidth, because the new geometry has to go to the card every frame. So you do some math and realize that PCIe buses are really quite fast and this is a non-issue. Yet the actual performance isn't that fast.

The non-obvious cost is synchronization. When you map your buffer to place the new vertices using glMapBuffer, you're effectively waiting on a mutex that can be owned by you or the GPU - the GPU will keep that lock from when you issue the draw call until the draw call completes. If the GPU is 'running behind' (that is, commands are completing significantly later than they are issued) you'll block on the lock.

Why is there a lock that we can block on? Well, there are basically two cases:
  1. The "AGP" case: your VBO lives in system memory and is visible to the GPU via the GART. That is, it is mapped into the GPU and CPU's space. In this case, there is only one buffer, and changing the buffer on the CPU will potentially change the buffer before the draw happens on the GPU. In this case we really do have to block.
  2. The "VRAM" case: your VBO lives in both system memory and VRAM - the system memory is a backup/master copy, and the VRAM copy is a cached copy for speed. (This is like a "managed" resource in D3D, if I haven't completely misinterpreted the D3D docs, which I probably have.)
In this second case, you might think that because the old data is in VRAM, you should be able to grab a lock on the system memory to begin creating the new data without blocking. This rapidly goes from the domain of "what can we observe about GL behavior" to "what do we imagine those whacky driver writers are going under there". The short version is: that might be true sometimes, other times it's definitely not going to be true, it's going to very much depend on how the driver is structured, etc. etc. The long version is long enough to warrant a separate post.

D3D works around this with D3DLOCK_DISCARD. This tells the driver that you want to completely rebuild the buffer. The driver then hands you a possibly unrelated piece of memory to fill in, rather than waiting for the real buffer to be available for locking. The driver makes a note that when the real draw operation is done, the buffer's "live" copy is now free to be reused, and the newly specified buffer is the "live" copy. (This is, of course, classic double-buffering.)

You can achieve the same effect in OpenGL using one of two techniques:
  • If you have OpenGL 3.0 or GL_map_buffer_range you can use the flag GL_MAP_INVALIDATE_BUFFER_BIT on your glMapRange call to signal that the old data can be discarded after GPU usage.
  • You can simply do a glBufferData with NULL as a base pointer before you map. Since the contents of the buffer are now undefined, the implementation is free to pull the double-buffering optimization. (See the discussion of DiscardAndMapBuffer in the VBO extension spec.)
If you develop on a Mac, you can see all of this pretty easily in Shark. If you map a buffer that you've rendered to without first "orphaning" it with glBufferData, you'll see (in a "time profile - all thread states" profile that captures thread blocking time) a lot of time spent in glMapBuffer, with a bunch of calls to internal functions that appear to "wait for time stamp" or "wait for finish object" or something else that sort of seems like it might be waiting. This is your thread waiting for the GPU to say it's done with the buffer. Orphan the buffer first, and the blockage goes away.

Thursday, February 18, 2010

Alpha Blending, Back To Front, Front To Back

I was reading NVidia's white paper on smoke particles and came across the notion of front-to-back blending. The idea is to change OpenGL's blend equation so that you can start at the front and blend in behind translucent geometry.

To blend front to back, you must have a destination surface that has an alpha channel, because the surface alpha channel remembers how much the next layer "shows through' the closer layer already put down.

To render front to back, we need to do three unusual things:
  1. Init our background to all black, all translucent (0,0,0,0).
  2. We set a blend function of GL_ONE_MINUS_DST_ALPHA, GL_ONE. This means that the new layer is dimmed to be the remainder of the opacity already put down.
  3. We need to pre-multiply our fragment's RGB by its alpha, because this isn't being done by the alpha blender anymore.
One of the fun side effects of front-to-back transparency is that the final alpha channel in our surface is the correct alpha to draw our composited layers over another scene.

One down side of front to back is that we can't use it on top of an existing scene unless the existing scene has an alpha channel that is set to clear. (This is usually not what you'd find after rendering.)

Compositing

If you then want to put the front-to-back mixed layers on top of another layer, you need to use a blend function of GL_ONE, GL_ONE_MINUS_SRC_ALPHA. Why? Well, since we rendered over black, our mix is "pre-multiplied" by its alpha value - that is, more transparent areas are darker. So we disable the alpha multiply.

Back To Front Revisited

If we render back to front, we can use GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, and not premultiply in shader. There's just one problem: alpha poke-through. Basically if you layer four polygons on top of each other, each with 50% opacity, the end result will be very close to 50% opacity, but the correct result should be 1-0.5^4, or 93.75% opaque. So with "standard" back-to-front opacity we can't later blit our accumulated texture.

It turns out we can work around this with some GL voodoo:
  1. Init the background to black opaque (0,0,0,1).
  2. Set the blend function to GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA for color but use GL_ZERO, GL_ONE_MINUS_SRC_ALPHA for the alpha coefficients. This requires glBlendFunctionSeparate and GL 1.4.
  3. When it comes time to mix down, use GL_ONE, GL_SRC_ALPHA
Um...what?

Here's what's going on: we need to use multiplication to "accumulate" opacity. But since multiplication tends to move colors toward zero, and zero is transparent, multiplying our fragments alpha together tends to make things more transparent. So this scheme is based on treating 1.0 as transparent and 0.0 as opaque. Let's review those steps:
  1. Since 1 is now transparent, we init our buffer to alpha=1 for transparency.
  2. By using alpha coefficients of GL_ZERO, GL_ONE_MINUS_SRC_ALPHA, we are multiplying the destination alpha by the source alpha. So here we have our "multiplying" to build up opacity. By using GL_ONE_MINUS_SRC_ALPHA we invert our alpha - the fragment outputs 0 = transparent and this converts it to 1 = transparent. The existing alpha in the framebuffer is already inverted.
  3. When we go to actually composite, we use GL_SRC_ALPHA instead of GL_ONE_MINUS_SRC_ALPHA because our source alpha is already inverted. (The source factor is GL_ONE because, like all pre-made blend mixes, we are pre-multiplied.)
It took me a little bit of head scratching to realize that the blend equation (a*b+c*d) can be used as a multiply instead of an add.

Friday, February 12, 2010

Multipart MIME and Apple Mail

I finally figured out why attachments from our bug report script don't have icons in Apple mail: Apple mail requires multipart/mixed as the MIME type, while Thunderbird will accept multipart/related.

Apple mail also cares about Content-disposition; it will show an icon for "attachment"-style disposition, even for text files, but it will show the text (with no markings showing it is an attachment) for "inline" style. Thunderbird shows the full text, with horizontal rules, no matter what.

Wednesday, February 10, 2010

How To Change Your UV Map on the Fly

I've been playing with "stupid UV map tricks" lately - the basic idea is to (in the fragment shader) change the texture coordinates before fetch. For example, given a texture divided into equally useful grid squares, we can on a per-grid square basis change which square we're in, to make the texture repetition less obvious. Why do this?
  • You can make your textures look less repetitive without making meshes more complex.
  • Since the effect is in-shader, it can be turned off on lower end machines - scalability!
But there's this one bit of fine print, and it escaped me for about four months: if you want to "swizzle" the UV map in a discontinuous way (e.g. using "fract", "mod", etc.) you need to use the explicit gradient texture fetch functions! If you don't, you get artifacts at the discontinuities.

Huh?!?!

In order to understand why this is necessary, you first have to understand how the hardware selects a mipmap level, and to understand that you have to understand how OpenGL generates derivatives.

First the derivatives. Most of the video cards I know about generate derivatives of a shader variable by "cross-differencing" - that is, a 2x2 block of pixels is run using the same shader, and when the shader hardware gets to the derivative (dFdx, and dYdx) it simply subtracts the interim values from the four pixels to find how much they "change" in the box. In other words, the derivative function in GLSL works by discreet per-pixel sampling.

(BTW this is why when you screw up code that needs to treat derivatives carefully, often you'll get 2x2 pixel artifacts.)

These derivatives allow the graphics card to select a LOD. At the sight of a texture fetch, the card can do a derivative operation on the input texture coordinates and see how fast they change per pixel. The faster they change, the lower the effective texture res and the lower LOD mip-map we need. That is how the card "knows" to use the lower mip-maps even when you use expressions for your texture coordinates - the derivative is taken on the entire expression.

But...what happens when you have a discontinuity in your UV map? Take a simple case like "fract". If you "fract" a wrapping texture, you will quite possibly see an artifact at the edges. This is because, right at the edge, the rate of change of the UV map is much higher than before, as it "jumps" from one edge of the texture to the other. High rate of change = low LOD - the graphics card goes and selects the lowest level LOD it has!

(If you don't know what's in your lowest mip, you might not know where the color was coming from.)

The solution is here: texture2DGradARB. This function lets you separately specify the texture coordinates and the derivatives. Here's a simple example. Imagine you have this:
vec2 uv_swizzled = fract(uv);
vec4 rgba = texture2D(my_tex, uv_swizzled);
That example will create a few pixels of low-mipmap texture at the discontinuity (where the texture goes from 1 back to 0). To use texture2DGradARB, you do this:
vec2 uv_swizzled = fract(uv);
vec4 rgba = texture2DGradARB(my_tex,uv_swizzled,dFdx(uv),dFdy(uv));
By using the original (continuous) texture coordinates for the derivative, but the modified ones for the fetch, you can have discontinuous fetches with no LOD artifacts.

NVidia and ATI cards don't respond the same way to discontinuous coordinates, but both will produce artifacts, and both are right to do so.

One last note. From the shader texture LOD extension:
   Mipmap texture fetches and anisotropic texture fetches
require an implicit derivatives to calculate rho, lambda
and/or the line of anisotropy. These implicit derivatives
will be undefined for texture fetches occuring inside
non-uniform control flow or for vertex shader texture
fetches, resulting in undefined texels.
I can tell you from experience that a number of my artifacts have come from conditional code flow. I believe that by non-uniform control flow they mean the case where the shader branches are not all taken the same way for a 2x2 block, but I am not sure.

Running Out of Derivative Res

In a previous post I went over the math behind generating the coordinate system for normal mapping in a pixel shader, which allows you to use tangent space bump mapping without encoding coordinate axes on your vertex mesh. (In X-Plane we do this so that we can allow authors to add bump maps to "unmodified" meshes.)

One of the problems with writing shaders is that it can be write-once, debug everywhere. As it turns out, this technique has a problem that I can repro on a GF8800 but not HD4870. On the 8800, I run out of precision in my derivative (dFdx and dFdy) functions.

In the scene in question, the UV map is generated in the vertex shader via projection off the world-space input vertices and the input mesh is big - 300 x 300 km in fact. (It is of course the base terrain.)

This means that the UV coordinates are pretty big too, particularly for highly scaled up textures. And that means that the effective resolution limit of the texture coordinates may be larger than one pixel.

When this happens, the result is a derivative that will be inconsistent across pixels, and the basis for the bump map will be corrupted on a per-pixel level.

Work-arounds? I can think of two:
  1. Modify the texture coordinate generation system to produce higher precision UV maps.
  2. Modify the shader to generate basis vectors from the projection parameters (rather than by "sampling" via the UV map) in the texture coordinate generation case.

Monday, February 08, 2010

glXGetProcAddressARB Syntax

I was slightly astounded to read that glxGetProcAddressARB is declared like this:
void (*glXGetProcAddressARB(const GLubyte *procName))();
Wha? Well, fortunately when you read the spec you'll note that they're just being clever...that's very strange C for
typedef void (*GLfunction)();
extern GLfunction glXGetProcAddressARB(const GLubyte *procName);
In other words, unlike all other operating systems, which define the returned type of a proc query as a void *, GLX typedefs it as a pointer to a function taking no arguments and returning nothing.

Why this is useful is beyond me, but if you are like us and call one of wgl, AGL, or GLX, you may have to cast the return of glXGetProcAddressARB to (void *) to make it play nice with the other operating systems.

Thursday, February 04, 2010

How To Scroll the OpenGL World

So...despite my best efforts to post ridiculous and stupid ideas to this blog, there appear to still be people reading and commenting on it. Chris and I don't really understand this at all, but what the heck: this post is aimed at soliciting feedback. I'm wondering if I've missed a very basic case in a very basic problem.

The problem is the scrolling world. If you have a 3-d "world" in your game implemented in OpenGL, you're up against the limited (32-bit at best) coordinate precision of the GL. As your user migrates around the world and gets farther away from the origin, you start to lose bits of precision. At some point, you have to reset the coordinate system.

I see three fundamental ways to address this problem:
  1. Stop the world and transform it. This is what X-Plane does now, and it's not very good. We bring multi-core processing into play, but what we're really bottlenecked by is the PCIe bus - many our meshes are on the GPU, and have to come back to the CPU for transformation.

    (Transform feedback? A cool idea, but in my experience GL implementations often respond quite badly to having to "page out" meshes that are modified on card.)

  2. Double-buffer. Make a second copy of the world and transform it, then swap. This lets us change coordinate systems quickly (just the time of a swap) but requires enough RAM to have two copies of every scene-graph mesh in memory at the same time. We rejected this approach because we often don't have that kind of memory around.

  3. Use local coordinate systems and transform to them. Under this approach, each small piece of the world is in its own local coordinate system, and only the relationship between these "local" coordinate systems and "the" global coordinate system is changed.

This third approach strikes me as the most promising one, but it also strikes me as difficult from a mesh-cracking standpoint. I don't see any way to guarantee that two triangles emitted under different matrix transforms will have the same final device coordinates, and if they don't, there can be mesh artifacts.

So that's my question: is there a way to connect two meshes under different coordinate transforms without cracking? Is there a limited set of matrix transforms that will, either in theory or practice produce acceptable results? Do game engines just hack around this by using clever authoring (e.g. overlap the tiles slightly and cheat on the Z buffer)?

Wednesday, February 03, 2010

The STL Is Not An Abstraction

I came to a realization the other day, having been burned by the STL for approximately the 100,000th time. Okay here goes that quotable crap again:
The STL is not an abstraction; it is a shortcut.
In computer programming, an abstraction is something that hides the details. Abstractions let us get stuff done, and most of the time they leak. Is the STL the leakiest abstraction in the universe?

No. It's not an abstraction at all. Abstractions hide implementation from you - the STL simply provides implementation.

An indication that the STL is an abstraction would be that you could change the implementation of an STL algorithm or container and not notice. Does the STL meet that criteria? I don't think so, at least not in any sane way.

With the STL, you need to know all of the fine print for any algorithm or class you do. Picking the type means picking an algorithm or data structure for its strengths and weaknesses. For example, if you pick vector, you are picking the following:
  • A simple, compact representation.
  • Blazingly fast random access iteration.
  • The copy constructor of your data is going to be called a gajillion times.
  • Mutating the size of the vector is going to hose outstanding iterators.
  • Non-far-end insertion and deletion cost a fortune.
That's how vectors roll. A container abstraction might hide these things; picking vector prescribes what will happen, pretty exactly.

And that's okay; typing vector is still faster and less error prone than typing int * and remembering not to screw up the dynamic memory allocation. But let's recognize what the STL is: a way to make certain known containers and algorithms much faster to put into your code - not a way to write code without knowing what your algorithms and containers do!

Monday, February 01, 2010

Moore's Law and Openness

If you look back at Windows and how the west was won, you'll see a story of network effects and compatibility: an unbroken chain of being able to run old apps unmodified from DOS to Windows, and an architecture (x86) that we're still stuck with today. If there are two lessons to take away, it might be:
  • Software takes forever to die - it's really hard to throw it out and start over again.
  • Network effects are very strong - once all the apps are on Windows, everyone wants to run Windows. Once everyone runs Windows, we want to write apps for Windows.
I realize that this blog article might look really, really stupid in a year or two (and in that case, all hail Google, our new overlords). But...the strong networking effects in the embedded games space all point towards the iphone. App developers know that if you want to be on one platform to make money, you have to look at the iphone first, even if you hate Objective C. And if you want to run apps, the iPhone is in its own category. (Just spend a car ride with an iPhone owner and you'll see..."Look, you can flick a ball of paper". I can't knock it - it's a fun app!)

What's weird here is that the iPhone is pretty much invented out of whole cloth. It doesn't run software from any other platform, it builds its UI off of Objective C and Cocoa (which, to the non-Kool-Aid drinking half of the Apple third party development community looks like a new way to force us to use what we've been ignoring for years) and Apple has had the device locked up from day 1. This couldn't be more different than how Windows gained domination. So how did we get here?

Clearly having a beautiful device way before everyone else makes a huge difference. But I want to focus on another idea: is it possible that technology "productivity dividends" have fundamentally changed the calculus of building a new platform?

Development of applications for the original Macintosh was, by modern standards, brutal. You had 128K for the OS and your app, and it was a tight squeeze. Every line of code was performance critical and size critical. Those first GUI-based apps were written by some seriously brilliant programmers who had to sweat bullets.

Fortunately for us working programmers, computers are now much much faster and bigger. Instead of writing apps that are millions of times faster (which no one would care about - at some point, the window appeared to open instantly and any speed improvement is moot) we write at a higher level of abstraction, which means we write apps more quickly. To draw a supply and demand analogy, apps for the iphone (or any computer now) are less expensive in man hours because we have better tools that trade hardware horsepower for ease of development.

So that might partly explain why Apple now has 140,000 apps or so on their phone. It's not that hard to write them. But what about this business where Apple hand-picks apps and rejects the ones they don't like? My first reaction as an iPhone app developer was "hrm....it sure looks like a real computer, but man is it locked down." It certainly wasn't what I was used to.

The iPhone is surprising device to develop for, because as an app developer, you aren't given the tools to hose the machine. As a Windows developer you might be grumpy that, after decades, Microsoft has finally said that you can't dump files randomly in the system folder without user permission, but the iPhone takes things more seriously. It's somebody's phone, damnit, and your app isn't getting outside of its sandbox, let alone into the OS.

I see the fact that the iPhone has successfully developed a third party market despite being locked down as an indication that user demands may be changing. In the old world, where apps were rare and expensive to write, what we wanted was: more software. Perhaps in the new world, where writing apps isn't so hard, what users want is an experience that focuses on quality rather than quantity of apps.

(Or to put it another way: if you would agree to audit every single piece of software that a user might put on their Windows computer and guarantee that none of it was going to wreck that computer, you'd have a service you could sell. The iPhone comes with that out of the box.)

Of course, I could be missing the point entirely; the iPhone cuts distributors out of the loop, with sales going only to store and studio - perhaps that's enough to launch 140,000 apps.