Monday, March 31, 2008

The Superiority of Ranges

When it comes to describing an interval (whether it is a range in memory, a range on the screen, a range of times) if you work in C/C++, your life will become easier once you recognize the truth: the language wants you to work using "begin/end+1" for ranges.

To clarify the problem, when we have to describe a range of values, there are typically three ways that programmers use a pair of integrals to describe the range:
  1. First item, number of items ("start/length")
  2. First item, last included item ("start/end")
  3. Fisrt item, one past last item ("start/end+1")
This third way is the best way...I don't say that lightly...but after years and years of programming, I find that code using start/end+1 comes out cleaner and simpler every time.

First, why not start/end? Well, the nice thing about start/end+1 is that the length of the range is just end-start. When you use start/end, you have to do the awkward (end-start+1).

Range checks also become inconsistent...basically you want to have your regions defined inclusive on one side and exclusive on the other. In other words, being in the range should be true if x >= x1 && x < x2. What's good about this is that if the end value of one range is equal to the start value of another, the two perfectly align, and the two cover an interval completely. With start/end notation, you get tons of off-by-one situations.

Similarly, start/length simply requires you to write more code. Consider start/end+1...a number of useful operations become trivial:
  • The union of [x1,x2) and [y1,y2) is just [min(x1,y1),max(x2,y2))
  • The intersection of the two is [max(x1,y1),min(x2,y2))
  • A range is empty if x2 <= x1
  • Containment and asymmetric set differences are all equally clean.
Try rewriting the above logic with either start/end or start/length and I think you'll that start/end+1 gives you the simplest code.

Tuesday, March 25, 2008

Message Q's and Sins

Well, it's been at least a year since I ranted about how much I love message queues, so what the heck. Every programmer has a pet idiom and for me the message queue is it - from my perspective it represents one of the cleanest ways to add parallelism to an app.
  • When an algorithm isn't naturally parallel, pipelining is often the best way to split it across cores. Message queues are great for creating the links between the pipeline stages.
  • Message queues naturally solve the problem of resource ownership without extra locking. Since the message cannot be at the send and receive side of the queue at the same time, "owning the message = owning the resource" gives you clean resource allocation. Because there are no locks, you don't have to worry about dead-lock cases or locks getting contested.
A typical message-queue design in X-Plane involves two queues - one sends work to be done to a worker thread and the other sends the results back to the main thread. The worker thread blocks and only runs when there are tasks and the main thread polls the "done queue" periodically at times when it can handle the results. There can be more than one worker thread if desired.

When evaluating this kind of design we have to consider two performance aspects: throughput and latency.

For throughput message-queue designs work pretty well. We can easily scale up the number of worker threads based on the number of cores and simply pile a huge number of tasks into the work queue - that'll keep all of those cores busy.

Where message queue designs don't work as well is latency. When we queue a new task to be done, we have no idea how long it will take to get done, but we know the answer is "longer than if we didn't have a worker thread". Besides the time for the message to get to the thread (that thread might be asleep and have to wake up and then wait for CPU time) we also might have a whole pile of messages ahead of us. Some designs use priority message queues to get preferential dispatch of messages, but even then all threads might be busy. (Even more complexity can be added to try to address that case.)

We also pick up latency on the return side, since the main thread has to check for finished results, which might only happen once per simulation frame.

Fortunately for X-Plane, the kinds of things we push out to threads usually aren't very latency sensitive - we build up 3-d scenery, but we queue it well before we arrive at that location, so the additional latency is acceptable.

Sunday, March 02, 2008

The Greatest 24 Pages Ever

(Okay, that might be a slight exaggeration but...)

Everything you wanted to know about the Radeon 9500 - X1900 but were afraid to ask. (PDF)

The reason I love this document so much is the scope it covers: exactly how the hardware is different from the specs. One of the tricky aspects of programming advanced OpenGL applications these days is that a lot of the fine print can't be expressed in terms of the extensions. For example:
  • The R300/R400 (X100-X850) GPUs will render to floating point bufferers, but do not provide alpha testing, alpha blending, or fog on the back end of the frame buffer.
  • The R500 GPUs (X1000-X1900) will render floating point buffers with alpha and blending but only if the floating point buffer is 16 bitgs, not 32 bits. Fog is still not an option.
If you violate these constraints the GL will fall back to software rendering, which is usually not what you want. You can discover all of these things in beta and slowly work up a profile of supported hardware (there are "only" 8 distinct revisions of pixel-shader-capable graphics cards out there between nVidia and ATI) but when a document comes along and spells out all the fine print (including stuff you might not otherwise know), well, that's a gift!!!

One thing I learned that I would never have figured out: glClear gives better fill rate than simply over-painting the Z buffer on the R300.

The document even discusses internal precisions. (Hey X-Plane users...this is why your water looks "square" and "pixelated" on pre-X1000 Radeons; lower internal floating point precision and other R300/R400 hardware limititations.)

As far as I know, this is the closest thing by nVidia.