Friday, December 21, 2007

What's Wrong With This Code

What's wrong with this code, which uses a second thread to build new nodes in a scene graph while the main thread renders and periodically picks up finished work?
thread 2:
create mesh geo in memory
glBufferData to create a vbo
send msg to thread 1 with VBO ID
thread 1:
while(1)
{
if we get a msg
insert VBO into scene graph
draw scene graph
}

Or how about this logic, which uses threads and PBOs to asynchronously capture full screen to disk?

thread 1:
render frame
bind pbo
glReadPixels
unbind pbo
send msg to thread 2

thread 2:
wait for message
bind pbo
get pbo data
unbind pbo
write data to disk


Hint: the answer is the same!

Updated OpenGL Performance Lore

Performance Lore (e.g. "I did X on my app and it got faster so you should do Y") is inherently dangerous in that performance tuning without performance analysis is a bad idea. With that in mind, it does have a use, I think, in at least pointing out possible areas of exploration.

With that in mind, some notes on X-Plane 9 performance:
  • It doesn't take a very complex pixel shader to become fill-rate bound! In version 8 (shaders were almost like the fixed function pipeline) X-Plane was almost always limited by the CPU and our ability to find and emit batches, even at high FSAA. With X-Plane 9, even on high end cards, even modest sized shaders are the limiting factor. Never underestimate the cost of doing something ten or twenty million times!
  • Overdraw is bad! As soon as you get into shaders, you become fill-rate bound and then overdraw becomes so expensive. So using blending and glPolygonOffset to composite into the frame buffer becomes a bad idea.
  • Geometry performance: putting index buffers into VBOs gained us some speed.
  • Optimizing down the number of vertices by more aggressively using indexing gained some speed, and never hurt, even if the mesh was a bit out of order, even on older hardware.
  • Using 16-bit indices on Mac made no difference - perhaps they're not natively supported?
  • Using the "static" VBO hint instead of "stream" for non-repeating geometry gave a tiny framerate improvement in a few cases, but mostly caused massive performance failures.
A note on this last point: the driver essentially has to guess a resource usage pattern from your API calls. A lot of them will use a most-recently-used purge scheme, that is, once you run out of VRAM you throw out the most recent thing you used and replace it. If we used a real FIFO, we'd guarantee the worst case: we would flush and reload every resource into VRAM every frame, so MRU makes sense.

The stream vs. static VBO hint will usually get your VBO into VRAM (static) or simply use it by mapping via AGP memory (stream). If your total working set is larger than VRAM (something's going to go over the bus) and a VBO is used only once, stream is the best choice; save VRAM for things that are used repeatedly, like repeating meshes and textures.

So when I tried jamming everything into VRAM, we ran out of VRAM in a big way and started to thrash...MRU works well when we're over budget by a tiny amount, but for a huge shortfall we just end up with thrash.

None of this is really surprising; people have been telling us we'd be shader-bound for years...it just took us a while to get there.

Thursday, December 20, 2007

Vtune - where's the stack?

I used VTune the other day to look at some performance problems on my PC. (Turns out the performance problem is the PC.) There's one thing about VTune I don't quite get:
  • When you're using the adaptive sampling profiler (system-timer-driven sampling) you only get function names.
  • To get a stack crawl you need to instrument the application (invasive profiling).
Now I don't like invasive profiling at all for a few reasons...
  • It changes the performance characteristics of the application.
  • It's not really compatible with inlining.
When sampling with Shark on the Mac, you get the stack history of each sample, not just the function itself. This is pretty important when leaf functions are bottlenecks from certain directions. For example, it's one thing to know you're spending all your time in glDrawElements (duh). But who is calling it? The OBJ engine? The mesh engine? Something else?

Friday, September 07, 2007

How does COM know?

If you call an a non-blocking method on a COM object that's in a different apartment and you need to receive callbacks, you need to sit in a message pump of some kind. That fact is written in a few different blogs, but what I didn't understand until today is: why does that work? How is COM connected to the message pump?

First let's pull apart that statement with some basics:
  • Single-Threaded Apartment (STA) basically puts boundaries around sets of objects that are similar to the boundaries around a process. When calling between the apartments (sets of COM objects) we use an RPC mechanism instead of a direct function call.
  • There is a 1:1 correspondence between threads and apartments, and a 1:1 correspondence between threads and their message queues. Thus there is an appropriate message queue for each apartment, and posting to that queue will assure which thread gets the message.
  • RPCs are thus implemented by posting messages to the queue of the thread we're trying to reach. We then poll our own message queue until we get some kind of "reply" indicating the RPC has done. This looks like a blocking function call to client code.
In order for this to work, the thread in the apartment we are calling into must itself be waiting on a message queue. This would be the case if it is either (1) really bored and just querying the message queue or (2) it is itself blocked on a method call into another apartment, and is thus polling its queue to find out if its own RPC is done.

If this all seems like insanity to you, well, it is.

Now when I say "non-blocking" method call, what I really mean is: a method call that returns really fast but starts some work to be completed later.

Normally when a thread is blocked because it made an RPC into another apartment, that apartment can call right back because the same polling of the message queue to discover that the RPC is over allows other methods to be called. This simply means that the flow of code between COM objects can ignore STA when all method calls are "blocking".

But as soon as we have a non-blocking call, there is no guarantee that the client code is actually listening for method calls into its apartment. (By the rules of STA, if the thread is doing stuff, no calls CAN happen, because one thread per apartment.)

Typically client code will make the async call, maybe make a few, and then do some kind of blocking until we're done..for example, we might call WaitForMultipleObjects.

In this case the right thing to do is MsgWaitForMultipleObjects (followed by GetMessage/DispatchMessage if we get woken up for a message). This way while our thread is doing nothing, other apartments can call us back.

This works because the thread, message queue, and apartment are all 1:1 relationships. So to say "this thread needs to be open to COM RPCs" all we need to say is "this thread needs to block on its own message queue", which is done with GetMessage.

Friday, July 20, 2007

gl_FragCoord + ATI = World of Hurt

Well we just had a bit of a firedrill when we learned that gl_FragCoord causes some ATI cards on windows to revert to software shading. I don't know if this is fixed in a driver upgrade - we used a simple workaround suggested by a user on gamedev.net:
  • We store the clip-space coordinate in a varying variable. (We have to compute that anyway, it's just the result of ftransform()).
  • We then perform a perspective divide and change the coordinate range in the fragment shader.
Fortunately we wanted the screen coordinate as 0..1 so having clip coordinates is almost what we want anyway. The perspective divide must be per-pixel - varying variables only interpolate in a perspective-correct manner in homogenous clip coordinates, not window coordinates. At least, I think. :-)

This is a case where "the Google" comes in really handy -- having confirmation from other users of a known bug gives us a lot more confidence in the workaround.

Friday, June 15, 2007

list.size() is slow

Ever wonder how the STL list works in GCC?



What? No? You haven't?

Where's your sense of curiousity about the way your code works?

What? You have a life?!? What are you reading this blog for then?

Anyway, it's pretty simple. It's a doubly-linked list, where the node has a previous and next pointer. The head and tail of the list are linked together, and the list itself keeps a pointer to the head node.

This means that begin() and end() are fast.

Unfortunately, the list doesn't keep a size counter. This means that size() is O(n). So you can find the back of the list easily but getting its index is expensive.

That's probably fine, as turning the index back into an iterator is O(n) and the iterators are stable. By comparison though, CodeWarrior's list caches the number of items.

Moral of the story: empty() is more efficient than size()==0, and, um, know what's in your STL.

(Insert rant about STL being like sausage here...)

Thursday, June 14, 2007

Virtual Memory Dumping for Fun and Profit

We're trying to answer the question: why did X-Plane run out of memory? To answer that, we need to look at the contents of virtual memory. Now dumping virtual memory isn't so bad.

On Windows you call VirtualQuery with a base address. Start at 0 and increment by the returned RegionSize. When you get to dwTotalVirtual (as returned by GlobalMemoryStatus) that's a good time to stop.

On the Mac call vm_region, starting at 0 and incrementing by "size" until it returns an error code. One tricky thing: vm_region will skip over unused virtual memory. To "see" these holes (this is address space that can be used later) compare the address you pass in to the one that's returned. If you pass in a pointer that's in an unused region, the address will be advanced to the next region.

What if you want to show the DLLs/dylibs that are mapped into a given region? I make no promise for the performance of this method but...

On Windows, use GetMappedFileName. Pass in your process handle and a base address from VirtualQuery and it fills in a buffer with a DLL name if possible. This is in psapi.dll so it isn't available on non-NT-derived Windows.

On Mac, use dladdr, passing in the base address of the region. You'll get a dylib file path and a symbol name, although the symbol name tends to be junk.

Enum tags - C vs C++

So you're minding your own business when you come across a gem like this:
typedef enum __MIDL___MIDL_itf_qnetwork_0082_0001 {
AM_EXSEEK_CANSEEK = 1,
AM_EXSEEK_CANSCAN = 2
} AMExtendedSeekingCapabilities;

It doesn't take much to bring out the inherent beauty of C...it rolls off the tongue like the name of a fine perfume. Anyway, this is legal in both C and C++, but the rules are a little bit different.

First of all, this is a typedef statement. __MIDL___MIDL_itf_qnetwork_0082_0001 is the enum tag and AMExtendedSeekingCapabilities is the defined type. (I know, those aren't the real C++ spec names...)

A few things are clear:
  • In C or C++ we can assign the enumerated values (AM_EXSEEK_CANSEEK, AM_EXSEEK_CANSCAN) to an integral type, and we'll get the values 1 and 2.
  • I have no idea how big one of these enums is...check your local compiler settings. It may be that this enum only takes up 8 bits or it may have been allocated a full 32 bits. (We use ints for our enums in the X-Plane SDK to be sure how big they are. We lose type safety but see below.)
Now here's where things get fun. C is picky: it requires the "enum" keyword when using the tag but not when using the typedef. In other words:
typedef enum tag { a, b } def;

enum tag var1; // legal - enum keyword and tag go together.
tag var2; // illegal - tag requires the enum keyword.
enum def var3; // illegal - enum keyword cannot be used with a typedef.
def var4; // legal - typedef is usable on its own.

By comparison, all 4 of those statements will compile correctly in C++. You can see C++ code like this:
enum letters {
a, b, c, d
};
letters my_var = a;

This won't work in C++.

But C++ is more strict than C when it comes to typing. Given our above example, we can do this in C:
def my_var = 5; // Illegal in C++
In C++ you'll get an error -- you can assign an int to an enum but not vise versa.

Tuesday, June 12, 2007

Dyamic Cast is Slow but...

...stupid code is slower.

In a performance tuning situation, sometimes you have to figure out:
  • Is the problem that I'm not doing each operation fast enough or...
  • Is the problem that I'm simply doing too many operations?
(Of course, the answer can be both.) This is particularly tricky because...if an operation is slow, that can help you realize that you're doing too many of them. If you then cut the number of operations, who cares if it's slow? And if it's not slow, is there any harm in doing too many operations?

(Typically I will leave the operation slow while I cut down the number of operations, because having slow primitives makes it easy to observe the effects of improving algorithm time in a profiler. Then once I fix the algorithm, I tune the operation itself, possibly even temporarily disabling the algorithm improvement to make it easier to "see" what's going on when profiling.)

For WED (the X-Plane scenry tool) I decided to "code first, tune later" -- given that
the design could change a lot while in progress, it seemed best to not spend a lot of
time on code that might get thrown out. I knew that the fundamental design (a group of airports, each spatially far apart) could easily be tuned to prevent major O(n) parts just using indexing and culling.

The resulting unoptimized code was, well, simply awful. What really surprised me was how much time was being spent in dynamic_cast - pretty close to all of it.

The solution was mostly to fix stupid code, but in some cases to not use dynamic_cast.
  • The biggest single problem was a lack of caching. If we have tree structure, the intermediate nodes can't ask the leaf nodes for information on every function call - the result makse time complexity worse, not better!
  • When dealing with a large dataset, pass counts matter. For example, avoiding iterating on the hierarchy of airports when drawing a layer that doesn't care was a big win - it takes a long time just to visit all of that data.
Like many programs, WED has the problem of needing virtual functions along two axes - we want to specialize the combinations of various data-model types with various UI constructs. (Runway for the selection tool, taxiway for the selection too, runway for the vertex tool, taxiway for the vertex tool.)

I'm not a big fan of visitor, so my original design simply used dynamic cast within the specific UI construct to figure out which datamodel construct it had. It turns out the performance problem with this isn't that we're double-dispatching, but only how we do it, and for a large number of objects, dynamic_cast is simply too expensive. Providing a virtual function to get some kind of "type info" on the generic object is much faster.

So if there's any lesson on dynamic_cast here, I think it's just "dynamic_cast does some serious work" ... don't use it unless you really have to, and don't use it inside a hot loop, or where N is large.

Monday, June 11, 2007

Cause and Effect

I have a tree-view (think like the tree of files in an Explorer or a Finder window). Since all of my rows are based on objects with integer IDs, I have something like this:

map mOpen;

bool Is_Open(int id)
{
map::iterator i = mOpen.find(id);
return (i = mOpen.end() ? true : i->second;
}


Note that if we don't have an ID in the map, we pretend its value is true. This "defaults" the hierarchy to open whenever we come upon a new ID - similarly "clearing" the map opens everybody.

So then I write this clever toggle routine:

void Toggle_Open(int id)
{
mOpen[id] = !Is_Open(id);
}


And it doesn't work. Why not? Well, it turns out that the l-value mOpen[id] is evaluated before the right-side expression. Thus we have already called mOpen[id] before we ever call Is_Open.

Here's the subtle bug: calling the array operator on a map creates a default entry (by the default ctor of the value type) if there is no entry for the key. So mOpen[id] on an id that isn't present in the map inserts a default 0 into the map.

Now Is_Open was modified to return "true" for unknown IDs, but this code is making the ID known and false.

The result was: the first time the user clicks on an open triangle, it stays open. This is because it is being set to closed by map::operator[] and then opened.

The correct code is

void Toggle_Open(int id)
{
bool old = Is_Open(id);
mOpen[id] = !old;
}


This forces the current open state to be evaluated before the array operator can make a default entry.

To quote from chapter 5 of the C++ spec:

"Except where noted, the order of evaluation of operands of individual operators and subexpressions of individual expressions and the order in which the side effects take place, is unspecified."


I knew the specification was good for something. :-)

Interfaces and Diamonds

I have a set of abstract interfaces that go something like this:

class IPoint {
public:
virtual void SetLocation(const Point2& location)=0;
};
class IPoint_Heading : public virtual IPoint {
public:
virtual void SetHeading(double heading)=0;
};


I derive my point-heading from the point because it makes the interface more useful for clients. Becasue ALL point-headings MUST BE points, clients can simply set the location of a point-heading and use it like a point without having to do dynamic casting and failure-checking. So that seemed good.

Why is it virtual? Well, here's why:

class WED_Point : public virtual IPoint {
public:
virtual void SetLocation(const Point2&) p;
};

class WED_Point_Heading : public WED_Point {
public:
virtual void SetHeading(double heading);
};


Basically I inherit the implementation of my point into my point-with-heading to avoid recoding points. If IPoint isn't a virtual base class, then the implementation of SetLocation doesn't apply to WED_Point_Heading.

Now before I'll go on, let's examine that fact in more detail. It is a surprising detail. You might think: "well, WED_Point_Heading derives from WED_Point whichi implements SetLocation, what else would I need?" That's what I thought until I tried it, and then when GCC rejected it read the C++ spec.

The problem is simple: given a derived class with two bases (B1 and B2), a virtual function in B1 cannot provide an override for B2. (I don't see any C++ implementation reason why the language couldn't work that way, e.g. it is possible to create this structure in memory, but my guess is that the ensuing chaos of mixing in two unrelated bases and having one change the guts of the other would be worse than what we have now, by a lot.)

The reason that the use of IPoint as a virtual base solves this problem is becasue WED_Point_Heading only has one IPoint. The compiler merges the IPoint sub-object and thus WED_Point's overrides apply everywhere, because they're not being copied from one base to another, they're being applied to one virtual base and there is only one virtual base.

Now I tend to say that if you have an inheritence diamond you've probably already screwed up, and this case is no exception. The real problem here is the use of inheritence of implementation to get code reuse. The real way to unravel this is:
  1. Define two non-virtual, non-polymorphic implementation classes (point-guts, heading-guts) that provide the true run-time implementations of the "SetLocation" code and "SetHeading" code.
  2. Derive WED_Point_Heading only from IPoint_Heading.
  3. Use those same guts in WED_Point and WED_Point_Heading. Because the common implementation has been factored out, the only "code duplication" is the calling of the common implementation. I would say this is not even code duplication, but code disambiguation (a careful description of how we want to map our interface hierarchy onto a set of reusable implementation pieces.

Friday, April 27, 2007

Getting Started with WinDbg

I discovered WinDbg about a year ago and have found it to be the most powerful and complete debugger available for Windows PCs. Unfortunately, it can be a very intimidating debugger since most of the work is done via text commands. I'm going to use this blog entry as a place to dump most of my learnings about WinDbg. It will be updated quite often until I feel I've documented enough to use this as a reference to teach someone how to use it.

Setting WinDbg as the default Just-In-Time (JIT) Debugger


From the command line, run:
C:\> cd C:\Program Files\Debugging Tools for Windows
C:\Program Files\Debugging Tools for Windows> WinDbg -I

Adding Microsoft Symbol Server to the Symbols Path

Launch WinDbg, then go to File>Symbol File Path and add SRV*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols to the paths.

When you exit WinDbg, it will ask you if you wish to save changes for Workspace Base. You must select YES or it will not save your symbol path.

Monday, April 16, 2007

Musings on Optimization

First, I should say that the point of my post on the cost of virtual methods in C++ is absolutely not that virtual methods are too expensive. If I had to make a general rule, I would have to go the other way and say they're rather affordable.

I think the most important thing to understand is that, because time-to-market is valuable, and employees salaries are valuable, development time is a resource that must be deployed efficiently. Thus if you are going to spend time optimizing, it's important that you deliver real value for your work. This means two things I suppose:
  • Don't optimize code that's already fast enough.
  • Don't optimize code that is so irrelevant to your program's performance that it doesn't translate into improved user experience.
For example, by X-Plane efficiency standards, the date-time dialog box is not nearly as fast as it could be. We could do a lot to make it faster! But have we? Heck no! We have a wide variety of tricks and tools to make graphics code faster than it is, and each one of those tools takes up developer time -- to deploy it on a dialog box where no one cares how fast it is would be folly. No one cares!

Similarly we could try to optimize the performance of our menubar drawing code. The menu bar is drawn enough that it affects users. But the menu bar is such an infinitely small fraction of program performance that even if we made it 100x faster, the translation into overall program speed would be unmeasurably small.

So in that context, I can only say this about virtual functions: virtual functions are an organization tool that reduce the total time that it takes to code by giving a programmer better code-management and organization tools. Since the performance of a program is already limited by how many developer hours are available to optimize:
  • Most of the time virtual functions will outweigh their performance cost by improved productivity.
  • I suspect that if you are really worried about speed, you'll get better program performance by using "more expensive" OOP techniques in most parts of the program and using the freed-up development time to optimize the few parts that really matter.
This kind of performance is called opportunistic optimization - the idea is that you measure your program's performance and spend your time on only the parts that are both optimizable and are taking up a lot of time, thus maximizing the performance boost per developer hour. With tools like VTune and Shark getting easy-to-read high-quality data on where optimization is needed is now really easy.

(An instrumenting profiler changes your code to record timing data; an adaptive sampling profiler uses the hardware to poke at it at fixed intervals. While instrumenting profilers would seem to provide more accurate data, I don't like them because often the profiling overhead changes performance characteristics so much that you can't tell what the real problem is. This is particularly a problem for OpenGL-based applications, where half the performance is on the GPU and thus the CPU-GPU execution speed ratio matter a lot.

Adapative sampling profilers work because the most important code to fix is by definition running most of the time, so most of the samples will fall into functions you care about. For example, when profiling X-Plane we can find up to 40% of samples falling into OpenGL, mesh and OBJ draw code, which is indeed the most important part of X-Plane to tune. Sure the adaptive sampling profiler gives us really unrealistic data about the cost of drawing the menu bar, but the time spent there is so slow that we don't care.

Shark is strictly an adaptive sampling profiler. VTune comes with both types. GCC and most compiler packages also provide an instrumenting-profiler option.)

One more thought: X-Plane is fast by design -- that is, some of the things we did to have high throughput from the rendering engine were design decisions that had to be made early on. This goes against the idea of opportunistic optimization. If we were planning these ideas from day one, how did we know from running the program that they would matter? Did we just guess?

The answer is: we knew where to optimize in X-Plane 8 from the performance of X-Plane 7. That is, we use the design-induced performance limitations of a given revision to direct our development into the next version. This is just opportunistic optimization combined with code refactoring.

Tuesday, April 10, 2007

C++ Objects Part 8: The Cost of Virtual Methods

In my last post I covered dynamic casts and ambiguous base classes. This will be my last post on C++ object memory layout: the cost of using objects.

In truth I will only look at the cost of method calls - I will operate under the assumption that using a C++ object is like using a struct from a data-access standpoint, which is mostly true; you can see from past posts that virtual base classes represent containment by pointer while non-virtual base classes are contained by, well, containment, so you can imagine the costs involved.

Before getting into this, one overarching note: in my experience working on X-Plane, effective function inlining is much more important to the rapid execution of performance-critical C++ code than any of the overheads here. In particular, all of these examples assume that inlining has not occurred. The loss of inlining is a much larger cost than the additional pointer references, glue calls, etc. that we'll see.

The compiler can only inline a function when it knows exactly what will be called. Any time we have polymorphism, that goes away, because runtime data determines the call type. These examples are done with inlining and optimization off (mostly to keep the assembly code sane), but basically no call that requires a memory access before the function call to fetch an address can be inlined.

Test Cases

I wrote a stub of code to look at free functions and function pointers (to look at the cost of call-via-memory), virtual functions, and virtual base classes. In all cases I wrote a stupid stub program and let CodeWarrior disassemble the results.

Here's the C++:


static void an_func() { }

class foo {
public:
void non_virt() { }
virtual void virt() { }
};

class bar : public virtual foo { };


I have eliminated any arguments and return types, which avoids a lot of stack setup code that would otherwise make the examples harder to read in assembly. (We can't all be Larry Osterman. I agree with him 100%, but I suspect many programmers don't speak x86 as their first language.)

Here's the test environment:

foo f;
bar b;
void (* anfunc_f)() = an_func;
foo * ff = &f;
bar * bb = &b;

The optimizer is off - in real production code the optimizer may be able to share common setup costs for function calls. This exercise should demonstrate the total number of memory accesses required from a "cold start" to perform the function call.

Also please note: http://hacksoflife.blogspot.com/2007/04/wrong-side-of-war.html my x86 assembler sucks, so in some cases my explanation of what's going on may be sorely lacking.

Direct Function Calls:

Let's start with the simple case - a direct function call that hasn't been inlined.

C++:

an_func();

PPC:
0000001C: 48000001 bl .an_func__Fv
x86:
0x0000002c: E800000000 call ?an_func@@YAXXZ (0x31)


This is the simplest case - a direct function call to a known address - the destination is encoded in the program stream. In terms of my analysis I would say that this jump has a cost of "zero" memory accesses. Obviously the destination address comes from memory, but because it's in the program instruction stream, I expect that by the time we know we're going to branch, we hopefully know where too - no separate access to memory is needed.

Function Call Via Function Pointer

Before looking at this case, a brief tangent on the Mac: OS X for PowerPC, when using the CFM ABI (which happens to be the one I decompiled) uses an indirect function-pointer system called a "t-vector". Basically a function call outside of a single shared library (DLL) is called a "t-vector", and it's a pointer to the function pointer. The t-vector structure also contains additional state that is swapped in to registers as we make the call, effectively switching some runtime context per DLL. (This context is used to manage relocatable global variables - I'll comment on this more some other time.)

The reason I mention it is that a function call outside a DLL is more expensive than one inside a DLL for PPC CFM code. Why do I mention that? Because if we don't know what kind of function call we are going to make, the compiler has to use the expensive kind just in case we're jumping "long distance" into another DLL.

So starting with this case, we're going to start seeing a jump to "ptr_glue", which is a little assembly stub function that does the context switching, that is part of the C runtime.

Given the death of CFM and of the PPC on PCs, is any of this relevant? Only in this: if an ABI has both low and high cost function calls based on the amount of known information, calls through function pointers are going to involve the worst case, most conservative call mechanism (unless some kind of compiler intrinsic adds typing info to function pointers). So when we go to function-pointer calls, we may pick up some ABI overhead. In the case of CFM, the ABI overhead for a "cross-library" call is an extra memory indirection.

In this one case I'll list the PPC and Mach-O code:

C++:

anfunc_f();
PPC (CFM):
00000030: 7FACEB78 mr r12,r29
00000034: 48000001 bl .__ptr_glue
00000038: 60000000 nop

PPC (Mach-O):
00000030: 7FCCF378 mr r12,r30
00000034: 7D8903A6 mtctr r12
00000038: 4E800421 bctrl

x86:
0x00000031: FF55F0 call dword ptr [ebp-0x10] near


A few notes on this:
  • The x86 code is actually the most intuitive here - a subroutine call with indirect addressing via a memory location somewhere on the stack (or wherever ebp is aimed right now).
  • Because the PowerPC has so many registers, for very small snippets (or a very good compilers), local variables almost always happen to be in register near the time they're needed. So while you can't see the memory load in the PowerPC code (r29 and r30 just happen to have our function pointer in them) they certainly didn't get loaded for free. The compiler has basically just made my local "anfunc_f" varible a "register" variable.
  • The PPC will, under some conditions which I've forgotten and am too lazy to look up, call the function immediately after a branch. In some cases the compiler must insert a nop into this slot if it can't be used. I will omit the nops from now on, as they don't tell us much.
  • ptr_glue is our C runtime code that will make our "long distance" call. But a bl to ptr_glue with the destination address in r12 is approximately like a subroutine-call to the address stored in r12. (x86 hackers, the general purpose registers for PPC are named r0-r31 - they're all the same and all hold 32-bits on the 32-bit PPC chips).
The cost of a call to a function pointer is thus:
  • One extra memory access.
  • Any cost for a "long distance" function call based on the ABI.
That memory access might be notable because the memory access must be completed before the instruction stream can be continued. I don't know how well modern CPUs can pipeline this of course, but a cache miss could be expensive.

Non-Virtual Method

Non-virtual methods are basically free functions with some compiler renaming and an implicit pointer to the object as their first argument. (That's how they know what "this" is.) So this will look almost the same as our free function call except that we're going to see the stack get set up a little bit. These functions are fully known so they are candidates for inlining.

(Another PPC hint: arguments on the PowerPC are passed by register starting with r3 when possible.)

C++:

ff->non_virt();
PPC:
00000020: 7FC3F378 mr r3,r30
00000024: 48000001 bl .non_virt__3fooFv
x86:
0x00000034: 8B4DF8 mov ecx,dword ptr [ebp-0x8]
0x00000037: E800000000 call ?non_virt@foo@@QAEXXZ (0x3c)



So - the cost of a non-virtual method is comparable to a direct call to a free function.

Virtual Method Call, Fully Known Object Type

Virtual functions are essentially function pointers that live in a side-table. But if we make a virtual function call and the compiler is 100% sure at compile time of the type of the object, it isn't required to call through the vtable. (Calling via the vtable doesn't have any affect on program correctness as long as we end up calling the right function.)

C++:

f.virt();
PPC:
00000028: 80620000 lwz r3,f(RTOC)
0000002C: 48000001 bl .virt__3fooFv
x86:
0x0000003c: B900000000 mov ecx,offset ?f@@3Vfoo@@A
0x00000041: E800000000 call ?virt@foo@@UAEXXZ (0x46)


So the cost of a virtual method may be as low as a non-virtual method. But I wouldn't count on this. For example, CodeWarrior will generate this optimization for a virtual method defined in a base class and called on the base class, but for a reason that I am not sure of (probably I've missed some subtlety of C++) it won't call a virtual method defined in a base class and called on a derived class that does not override it.

Virtual Method Call Via Object Pointer

This is a true virtual method call - when we have a pointer to the object, we don't know what the runtime type will be, and we actually have to use those vtables.

Truly a virtual function call - since we have a ptr the compiler doesn't know the real type. Now we start to pay a little more for virtual functions.
  • Transfering r30 to r3 - this is copying the "this" pointer.
  • Copying whatever is pointed to by r3 to r12. This is dereferencing the first word of the object to get the address of its vtable. One memory access.
  • Copying whatever is pointed to by r12+8 to r12. This is going to the vtable and pulling up the 3rd ptr. (Before our virtual function is the type-id and offset info.) Two memory accesses.
  • We see pointer-glue again here. This is just the Mac's idea of a far function call.
So - the cost in total of using a virtual function via a pointer to the object is:
  • Two extra memory reads.
  • Any cost of far function calls.
So a virtual method is more expensive than a function pointer, by one memory indirection. If you wanted to optimize this out, you'd put function pointers into the object itself, and pay for the decreased memory chaining with larger object size. The vtable is a layer of indirection.

C++:

ff->virt();
PPC:
00000030: 7FC3F378 mr r3,r30
00000034: 81830000 lwz r12,0(r3)
00000038: 818C0008 lwz r12,8(r12)
0000003C: 48000001 bl .__ptr_glue

x86:
0x00000046: 8B55F8 mov edx,dword ptr [ebp-0x8]
0x00000049: 89D1 mov ecx,edx
0x0000004b: 8B31 mov esi,dword ptr [ecx]
0x0000004d: 89D1 mov ecx,edx
0x0000004f: FF16 call dword ptr [esi] near


To be perfectly honest, I'm not really sure why the x86 code looks like it does. Reading these you might think we've got more memory accesses in the x86 code, but that's due to calling conventions - the PPC code can handle the argument entirely in registers.

Either way we've got two memory fetches:
  • One to fetch the vtable ptr from the start of the object pointer.
  • One to fetch the actual function ptr, eight bytes into the vtable (skipping the type-id and offset fields, each 32-bits wide).
So the cost of a virtual function call is two memory accesses and any long function call overhead. In other words, virtual function calls cost one more memory access than function pointers. (On the other hand, we save memory by not having the function pointer in every single object.)

Virtual Method Call With Virtual Base Class

When we have a virtual base, and our pointer type is of the derived class, we pick up one more memory access. The compiler must convert our derived pointer into the virtual base type, which requires following a pointer - a memory access.

(We will need the "this" pointer to be adjusted to the base type when we make the method call. The effective argument type of the virtual function must be of the most basic class that defines the virtual function, because this is the only type the "this" pointer could have that client code could set up consistently everywhere.)

I mention the need of the "this" pointer because it occurred to me that if the vtable of a derived class contained all of its base class virtual functions, perhaps we could avoid having to cast first, then start the virtual function. (My idea was: if we could do these memory lookups in parallel, instead of fetching from the last pointer we read, we wouldn't be serialized trying to get data out of memory.)

But alas my idea doesn't work. Remember that the layout of the vtable of a class with a virtual base will be controlled by the full class, and we have no idea what full class "bb" is. So if we wanted to start providing duplicate virtual method entries in bb's vtable, we would have to duplicate every virtual function in the virtual base for every single intermediate derived class. Suffice it to say, the vtables would get somewhat unwieldy. (Speculation: this blog article emphasizes the cost of memory indirections, but perhaps the real cost is cache misses. So keeping the code small and tight and having vtables be small is probably a win. Better 3 memory fetches to cache than 2 to RAM!)

C++:

bb->virt();
PPC:
00000044: 807F0000 lwz r3,0(r31)
00000048: 81830000 lwz r12,0(r3)
0000004C: 818C0008 lwz r12,8(r12)
00000050: 48000001 bl .__ptr_glue
x86:
0x00000016: 8B45F8 mov eax,dword ptr [ebp-0x8]
0x00000019: 8B5004 mov edx,dword ptr [eax+0x4]
0x0000001c: 89D1 mov ecx,edx
0x0000001e: 8B31 mov esi,dword ptr [ecx]
0x00000020: 89D1 mov ecx,edx
0x00000022: FF16 call dword ptr [esi] near


So the cost of a virtual method call through a base class is three memory indirections and a long distance function call - one more indirection than if the base is non-virtual.

Final Thoughts: Examining the ABI

There were three techniques I used to find this information:
  1. The C++ runtime for CodeWarrior is available in source, so I was able to examine the source code and its comments.
  2. I disassembled simple code and also dumped run-time memory for simple code.
  3. In CodeWarrior, at least, you can replace run-time functions with your own code (if you know the "secret" names of the helper functions and, with the right compiler settings, get a link warning, not a link error. This allowed me to add a lot of tracing to the runtime pretty easily.
(If CW 8's debugger worked on OS X 10.4 worked, which it does not for me, I could have stepped through the code and runtime and dumped memory from there. But the debugger is really only a convenience tool for the two true debugging methods, printf and /* */.)

Monday, April 09, 2007

The wrong side of the war

I picked the wrong side...I've done a lot of work on PPC hardware, I think in Big Endian,
and I can read PPC assembly. With Apple's move to x86 hardware, how is that useful??

I thought this link provides a nice intro into stack frames and register usage in x86 assembly. This is very important - the x86 has so few registers that code that looks clean and simple on the PPC gets quite obtuse as things get juggled around. Between that, 453 addressing modes and 102,249 instructions, x86 assembler can look like the mad scribblings of Russel Crow in "A Beautiful Mind".

Any idea what this does? :-)

0x00000000: 55 push ebp
0x00000001: 89E5 mov ebp,esp
0x00000003: 56 push esi
0x00000004: 83EC04 sub esp,0x4
0x00000007: B8CCCCCCCC mov eax,0xcccccccc
0x0000000c: 890424 mov dword ptr [esp],eax
0x0000000f: C745F800000000 mov dword ptr [ebp-0x8],offset ?b@@3Vbar@@A
0x00000016: 8B45F8 mov eax,dword ptr [ebp-0x8]
0x00000019: 8B5004 mov edx,dword ptr [eax+0x4]
0x0000001c: 89D1 mov ecx,edx
0x0000001e: 8B31 mov esi,dword ptr [ecx]
0x00000020: 89D1 mov ecx,edx
0x00000022: FF16 call dword ptr [esi] near
0x00000024: 8B1504000000 mov edx,dword ptr ?b@@3Vbar@@A+0x4
0x0000002a: 89D1 mov ecx,edx
0x0000002c: 8B31 mov esi,dword ptr [ecx]
0x0000002e: 89D1 mov ecx,edx
0x00000030: FF16 call dword ptr [esi] near
0x00000032: B800000000 mov eax,0x0
0x00000037: 8D65FC lea esp,dword ptr [ebp-0x4]
0x0000003a: 8B3424 mov esi,dword ptr [esp]
0x0000003d: 8B6C2404 mov ebp,dword ptr [esp+0x4]
0x00000041: 83C408 add esp,0x8
0x00000044: C3 ret near

Friday, April 06, 2007

C++ Objects Part 7: Dynamic Casts and Ambiguous Bases

Continuing with our discussion of how C++ objects are implemented in CodeWarrior, let's look at dynamic casts when an ambiguous base class is present.

Because a dynamic cast always down-casts to the full class, and then goes back up, a cast that would be trivial for a static cast becomes more complex for dynamic cast, and can introduce ambiguous base cases that otherwise would not be an issue. But you really have to work to see this. Here is a class hierarchy that requires run-time resolution of a dynamic cast with an ambiguous base:

class R { };
class A : public R { };
class B1 : public A { };
class B2: public A { };
class D: public B1, public B2 { };

So what do we have to do to get an ambiguous base class dynamic cast?

First declare an instance of D.

Then cast a pointer to D to B1, then R. (We must cast to R via B1 because the up-cast from D to R is ambiguous - we have two equally valid copies of R within D.)

Now dynamic cast from our pointer to R to a pointer to A. What happens?

Not what you think? You might think that since A has one copy of R, we simply update the pointer and go home, but the dynamic cast must do runtime checks to make sure that we really are an R that is embedded in A. So the dynamic cast operator casts R all the way down to its full class (D) and then tries to up-cast from D to A, which is ambiguous. Oops!

The type-info structure has enough information to manage this. Every type-info structure has an array of all base classes (both direct and indirect).

An ambiguous base class has a flag, indicating that it is ambiguous (it will be listed multiple times) and an integer indicating how many of the following classes are all accessed via the ambiguous base class.

(This implies that we have to iterate over the base class array because it is not a true array - the items are variable size!)

There are a lot of conditions on using an ambiguous base class as we search for our class when looking at the list of base classes for a type during a dynamic cast. We must meet three conditions:

- We can only accept a base class if it is the same as the destination type we are casting too.
- We can only accept a base class if its offset in the struct matches the offset we used to get from our initial pointer to our full type.
- We can only accept base class if our starting type is a parent class of this particular base class.

What that means is: we can only cast to am ambiguous base directly from one of its parents. If we do, we need to make sure we have the right copy of this ambiguous base, which is why we check not only the type name, but the offset to this base from the full object (which disambiguates it.)

(How is this possible when this layout will vary from full class to full class? Well, remember that EVERY base that EVER goes into a class is part of the casting table for a type-ID, and you never have to use more than one type-id structure to perform a cast. So when we say "class D has two A's, one at 0 bytes and one at 16 bytes", that info is truly only in the typeID for class D. For some other class (also with A) we'd have a totally different set of casting info. In other words, the casting info is effectively contained WITHIN the typeID so that it can be unique to this FULL class.)

Note one case that won't cast for us:

R1
R2
A: R1, R2
B1: A
B2: A
D: B1, B2

Given obj of type D and a ptr via R1 (through B1), we can't dynamic cast from R1 to R2. Why? When we iterate through the ambiguous bases (A) we haven't STARTED on an A so we don't find the perfect match to the A from B1. Thus we fail with a bad cast!

(Is this right? I think so. As far as I can tell from the C++ spec, dynamic cast must succeed when we explicitly down-cast and the full class has some ambiguity in it, but cross-casts do not have to succeed. So casting from R1 to A is legal, but casting from R1 to R2 is not. This isn't hugely surprising, and I suspect that if your design depends on this cast, you need to reconsider the design!)

Friday, March 23, 2007

Debugging Memory Value Meanings

The Visual Studio compiler automatically assigns default values to certain types of variables depending on how they were declared (stack based vs heap based) and what types of variables they are (pointers etc). These values become very useful during debugging because you can tell what went wrong when an exception is thrown. Here are some common values and their meanings:

  • 0xfeeefeee - Memory reserved to heap allocation but has not been allocated yet
  • 0xcdcdcdcd - Uninitialized heap variable (Think of the "C" as Clean memory)
  • 0xcccccccc -Uninitialized stack variable (Think of the "C" as Clean memory)
  • 0xfdfdfdfd - Padding around an array (Think of the "F" as Fences around your variable)
  • 0xdddddddd - Released heap variable (delete or free) (Think of the "D" as in Dead)

Friday, March 16, 2007

C++ Objects Part 6: Ambiguous Base Classes

An ambiguous base class is a base class that is included in a derived class twice. Because you can't just derive from a class twice, this usually happens by having parent classes who both derive from the same class without virtualizing that common base.

The storage implications of ambiguous base classes are relatively simple. Because each parent class's memory layout contains the base, we know where the two bases will be in memory - we simply apply our pattern for inheritance via containment recursively.

class R { };
class B1 : public R { };
class B2 : public R { };
class D : public B1, public B2 { };

gives us

struct R {
vtable * vt;
};

struct B1 {
R base;
};

struct B2 {
R base;
};

struct D {
B1 base;
B2 base;
};

Thus we have two instance of R: obj.base1.base and obj.base2.base reference each copy of R.

Static casts are casts that can be computed knowing only compile-time information - that is, casts that don't require knowing the full class of the casted object.

A static cast from D to R is illegal because it is ambiguous. Do we want R via B1 or B2? We don't know, and because the cast is static the compiler knows we don't know at compile time and generates an error.

If you cast to B1 first and then R it is legal - you have disambiguated. (This should not be surprising - if we simply had an object of type B1, casting to R would work! Remember, memory layout doesn't change with type!)

Static casts give us compile-time checking of all casts for ambiguity - we always know what we are getting. Next I'll cover dynamic casts and ambiguous base classes, which are more complex.

Wednesday, February 28, 2007

C++ Objects Part 5: Virtual Base Classes

A virtual base class is a class that is included only once in your derived class no matter how many base classes refer to it. How does this happen?

The answer is two-fold:

First, the actual layout of a class with a virtual base depends on the full class. Consider this case:

class R { };
class B1 : public virtual R { };
class B2 : public virtual R { };
class D : public B1, public B2 { };

In this case R is a virtual base that is included twice in D. The location of R will be different for B1, B2 and D. Each time one of these classes is set up, R is placed in the class once. Since R cannot be in D twice, clearly the location of R relative to B1 and B2 can't be the same when B1 or B2 is the final class vs. D. (If both B1 and B2 took R with them, we'd have two copies in D, which is not acceptable!)

To get around this, we hit the second part of our answer: classes contain pointers to their virtual bases. With non-virtual containment, bases are contained, e.g. a fixed chunk of the derived class's memory is used for the base class. With a virtual base, the derived class simply contains a pointer to the base class. This lets us move R around and adjust the pointer to R (in B1 or B2) at runtime in a way that lets us use different memory layouts for R with various classes that contain B1 or B2.

It should be noted that in D while there is only one copy of R, there are multiple pointers to R - one in B1 and one in B2. This is what assures that the memory layout of B1 and B2 don't change when they are used in a derived class. Only the contents of the layout of B1 and B2 change. The compiler generates code to set up these pointers for us.

For CodeWarrior, pointers to virtual bases come before the pointer to the vtable. This means that to find the vtable we have to know how many virtual bases a class has. Fortunately this information is dependent only on class info that is available at compile time - it never changes.

Here's a C version of what we have above:

struct R {
R_vtable * vtbl;
// R data
};

struct B1_partial {
R * vbase;
B1_vtable * vtbl;
// data for B1
};

struct B1_final {
B1_partial self;
R vbase;
};

struct B2_partial {
R * base;
B2_vtable * vtbl;
// data for B2
};

strut B2_final {
B2_partial self;
R vbase;
};

struct D {
R * vbase;
R vbase_data;
B1_partial base1;
B2_partial base2;
};

One of the interesting things to note here is that B1 and B2 as included in D are not as large as B1 and B2 when used by themselves. This is because B1 and B2 must have storage for the virtual base R at their end when used by themselves. But when used in D, that storage is provided by D.

Now we can see why dynamic cast first goes to the most derived base. Given a ptr to R we have no
idea where the rest of the object is -- position of R varies with the actual final type - that is, the relative position of R in the object is not the same relative to B1 if the real object is B1 or D.

Fortunately the vtable of any object tells us where the final object's real start is. So when R is a virtual base, depending on its final type, it will refer to a vtable that takes into account the actual full class's layout, letting us recover the full type and figure out the true relationship between the virtual base and the whole object.

Note that a static down-cast from a virtual base will cause a compiler error. This is the compiler telling us that it can't even speculate what the memory layout of the virtual base in the final class might be without inspecting the real type at runtime. But dynamic down-casts and even cross-casts are legal, since all dynmic casts are a maximum down cast followed by a possible up-cast.

Tuesday, February 27, 2007

C++ Objects Part 4: typeid and casts

Now that we have described how multiple inheritance in CodeWarrior C++ works, we can look at how casting and type works in C++. We needed to discuss multiple inheritance first; recall from part 2 that for single inheritance, the "front" of an object in memory is its base. With this system, no work is needed at all to cast when we have only one base class.

For the purpose of this discussion, "full class" means the actual class of an object when it is instantiated - in other-words, the "biggest" downcast we can apply - it's real class.

Please also remember from our previous post that an object that inherits from two base classes has at least two vtables with different layouts. Clearly a pointer to a vtable is not a unique identifier of an object's class, if one object can have two different vtable pointers of different values.

So instead CodeWarrior creates a separate static structure for each class that uniquely identifies it: a type_info structure. This shouldn't be surprising - this is the same structure that you can use with the typeid operator. That is if you do this:

D obj;
const std::type_info& t = (typeid(obj));

"t" is a pointer to the internal type_info structure for class D that is used by the runtime for all operations that require typing. (t is a pointer - it looks like a reference in the C++ code, but in truth they are one and the same to the compiler.)

Every vtable starts with a pointer to a type_info structure, identifying the full type of an object. When an object has two vtables (as in our multiple inheritance case), both vtables point to the same type_info structure. No two classes will ever share a vtable (even if a derived class never overrides any member functions) so this is okay. Because both vtables point to the same type_info, we can get our full runtime class for an object no matter which base we have a pointer to.

Static Casts

With this in mind, let's look at static casts. A static cast is a cast between C++ class types whose effects are fully known at compile time. That is, the compiler uses compile-time type information to change the type (and possibly the pointer value) for an object. The compiler looks at the layout in memory of the base class and the derived class and adjusts the pointer as much as is needed. (When we have only one base class, a static cast will not need to adjust the pointer at all, because the base class is always first in memory.)

There are some casts that cannot be performed statically at all, but we will get to this later. Also, we can't check whether the cast is safe at compile time. If we have a pointer to X that we cast to type Y but the object is not actually of type Y, the compiler will not care and we may crash later.

Full Class and Offsets

When we first introduced the layout of a vtable we mentioned that there is an "offset" right after the type_info. But what is this offset?

The offset in a vtable is the number of bytes that a pointer to an object whose type uses this vtable must be adjusted to find the full class of the object.

So in our case where we have two bases, each embedded somewhere in a class, the vtable of each base contains the offset to recover the full object. Note that one vtable will probably have an offset of 0, and this vtable is the one we will use for the "full" class, so that we never adjust the pointer when we already have the full class.

(As far as I can tell, CodeWarrior layers all static vtables for a class consecutively, so when given a pointer to the first one, it can actually generate calls into any vtable, because their relative positioning is fixed.)

Dynamic Casts

A dynamic cast converts the type of an object using run-time information. The cast returns NULL if the object can't be viewed as a given type. Since dynamic casts look at the object at run-time and can see the full class of the object, they can perform casts that we can't do using only compile-time information.

(Consider a DLL that returns an object of some unknown type, but as a pointer to a base type. Without run-time type information, the compiler has no idea what real type that object is, because it doesn't have the source code. In practice even if we did have the code, the compiler can't track that object's pointers through all of the strange things C++ can do to memory.)

A dynamic cast works by examining the type_info structures of an object and using that information to perform the cast. So to understand dynamic casts, we'll need to look a the type_info structure.

The type_info structure contains a pointer to a static string that names the class - all dynamic casts will use class string-name comparisons to find equivalent type. It then contains a pointer to a "null-terminated" variable-length array of info about base classes. Each item in this array contains a pointer to the type_info of a base class (if this pointer is null, that indicates null termination) and an offset indicating how much the pointer to the full object must be adjusted to cast to this type.

It should be noted that this list of base classes contains all bases, not just immediate bases! In other words, if we have a class Z that derives from Y and Y derives from X, the array of base classes for Z contains X and Y! This means that with a single linear search of the list, we can find any possible base. We do not have to search the entire tree.

(As a side note, each type_info's array contains pointers to parents, so you can follow the chain of inheritance from type_info to type_info. As far as I can tell in CodeWarrior there is no recording of which bases are "immediate bases", as you never need this info in C++. Perhaps the order of the bases in the list would tell us but I am not sure.)

Dynamic cast is therefore a two-step process:

1. Given the vtable of a pointer to an object, use the offset to recover a pointer to the full class. (All adjustments will then be made from this pointer.) This is the equivalent of down-casting to the full class.

2. Search the type_info of the full class for the type we want - in other words, go through a list of all bases. If we find one, use the offset to adjust the pointer again. If the search in step 2 fails, return NULL.

(I am leaving out ambiguous base classes - we'll cover that later!)

Monday, February 26, 2007

C++ Objects Part 3: Multiple Inheritance

In my previous post I described how single inheritance works with the CodeWarrior C++ compiler - by placing the superclass as the first item in the subclass in memory (and pulling the same trick for the vtable). Now to something more complicated: multiple inheritance!

One thing is clear: we can't do the same trick for multiple inheritance as we can for single inheritance...only one superclass can be first!

The trick: change the pointer when we change the "class" of the pointer. We simply stick both bases into our subclass, but if we cast from the superclass to the second base, we move the pointer in memory to point to the second base class (since it isn't at the same address).

You can see this for yourself with a program like this:
class a { int a_; };
class b { int b_; };
class c : public a, public b { };
int main()
{
c obj;
printf("a=0x%08x, b=0x%08x, c=0x%08x\n", (a*)&obj,(b*)&obj,(c*)&obj);
}

When we print out a ptr to our object casted, we get this
a=0xbffff460, b=0xbffff464, c=0xbffff460

Note that when we cast to b, the address of our pointer moves in memory! This is because c contains a first in memory, then b second.

So we've at least solved the problem for object data: we'll put the base classes into the derived class sequentially and adjust the pointer any time we change the type, so that we point to the base WITHIN the derived class no matter where it is. This meets our rule that a pointer to the base must look like the base. It does because we move the pointer until it does.

Now let's apply the rule to vtables. We must have a pointer to a vtable as the first item in an object. So if the pointer to an object can point to either of two memory locations (the start of the first or second base) then we can't escape logic: we need TWO pointers to vtables!

The format of a vtable depends on the class we think we have; the contents of a vtable depend on the class we really have. So when we have multiple inheritance, we will have multiple vtables, so that each one can be formatted based on the base class (but filled with function pointers from the derived classes.

This example will get lengthy, so I will abridge it a little bit...

class B1 { virtual void b1(); };
class B2 { virtual void b2(); };
class D : public B1, public B2 { virtual void b1(),virtual void b2(),virtual void d(); } ;

In C we might have this:

struct B1_vtable {
typeid * type_id_ptr;
int offset;
// ptrs to B1 virt funcs
};
struct B2_vtable {
typeid * type_id_ptr;
int offset;
// ptrs to B2 virt funcs
};
struct D_vtable {
B1_vtable parent1;
B2_vtable parent2;
// ptrs to D vfuncs
};

Because D contains the first base's virtual function table first, the "header" information (type-id and offset) for the first base are used for the derived class. This will be important later. Because all of the virtual functions are consecutive in memory, when we have the derived class we can easily find any virtual method's function pointer.

(So when I say we have two vtables really we have one big vtable with two vtables embedded inside it.)

void B1_b1() {}
void B2_b2() {}
void D_b1() {}
void D_b2() {}
void D_d() {}

static type_id B1_type_id = { ... };
static type_id B2_type_id = { ... };
static type_id D_type_id = { ... };

// Virtual functions for B1:
static D_vtable sB1_vtable = { &B1_type_id, 0, B1_b1 };
// Virtual functions for B2:
static B2_vtable sB2_vtable = { &B2_type_id, 0, B2_b2 };

So far there are no surprises here, until we look at the virtual function tables for D:

// Virtual functions for D
static D_vtable sD_vtable = { { &D_type_id,0, D_b1 }, { &D_type_id, -sizeof(B1), D_b2 }, D_d };

Here we have a vtable that contains two vtables within it. Thus we have two pointers to our type-IDs. If we have a pointer to the second one, it looks like the vtable used when looking at our object of type D as if it was a B2. Remember that the pointer to the object changes as we cast, and that controls which vtables is used. (Whenever we call the virtual method b2(), we have a pointer to a B2* and therefore our object pointer is adjusted to the second vtable where we get a D_b2 object.

This is the first time the 'offset' parameter of the vtable isn't 0. I'll explain this in the next post, but for now trust that, since B2 is not the first parent class inside D, it needs a non-zero offset.

struct B1 {
B1_vtable * vtable;
// data for b2
};

struct B2 {
B2_vtable * vtable;
// data for b2
};

struct D {
B1 base1; // base1.vtable inited to &sD_vtable.parent1
B2 base2; // base2.vtable inited to &sD_vtable.parent2
// derived data for D
}

When D is initialized, the vtable for B1 is inited to one value and the vtable for B2 is inited to another value.

Saturday, February 24, 2007

C++ Objects Part 2: Single Inheritance

In my previous article I described the basic layout of a C++ object in Metrowerks CodeWarrior, with a pointer in the object to a static table of function pointers and a class description structure. Now we can look at how single inheritance is implemented.

Simple inheritance is, well, simple: the derived class simply contains the first class at the beginning and new data later in memory. The same thing is true for the vtable - new entries are tacked on the end.

(Please note that this is not true for virtual base classes - I'll get to that later.)

This meets our rule: a pointer to X must be laid out the same. Since derived class Y contains X at its beginning, the first bytes of Y look like an X.

Here's subclassing from the last post...

class bar : public foo {
virtual void b(float); // ovrrride
virtual void new_method(int);
char * new_member;
};

virtual base class - only one copy of the base no matter how we derive. How?

struct bar_vtable {
foo_vtable parent;
void (* new_method_func)(int);
};

void bar_b(float);
void bar_new_method(int);

static type_id bar_type_id = { ... };
static bar_vtable sbar_vtable = { &bar_type_id, 0, foo_a, bar_b, bar_new_method };

struct bar {
foo parent; /* superclass goes first. */
char * new_member; /* then new members. */
};

Note that a new vtable is made for the derived class, which may contain a mix of functions - here we have the "a" virtual function from the superclass, an overridden "b" function (bar_b, an override, goes into a slot in foo's vtable), and then a new virtual function in the end.

In fact, bar_b in a slot in a foo_vtable is where the "magic" of overrides happen. This is what lets us call a method from a foo * and get a method from a bar object. Because an object points to its classes vtable, we get that class's behavior.

Friday, February 23, 2007

C++ Objects Part 1: Basic Object Memory Layout

I spent a few minutes dissecting the C++ internal runtime structure for objects within Metrowerks CodeWarrior. There's some internal guts that C++ generates when you make objects; the format is not standardized between compilers, but looking at one implementation reveals some of the costs and implications of C++ objects and inheritance.

C++ breaks objects with methods down into data structures and functions, with function pointers used for virtual functions. Polymorphism depends on our ability to apply fixed unchanging code to diverse data structures and get meaningful execution. Therefore we can immediately postulate one rule that applies to all objects, no matter what C++ feature is used (inheritance, multiple inheritance, virtual base classes, etc): the in-memory layout of a pointer to class X must have a fixed layout no matter what real derived type the object has.

I'm only going to describe the case where objects have virtual functions and run-time type information. Based on the actual code and compiler settings, the compiler may sometimes eliminate these features, changing the in-memory structure, but that's a topic for another blog entry.

CodeWarrior (like virtually all C++ compilers, no pun intended) implements virtual methods via virtual function tables (vtables), which are static structures containing one function pointer for each virtual function. An object has a pointer to its class's vtable before any other object data; there is only one copy of the vtable per class - it's immutable and shared. (This makes objects smaller than having pointers to each virtual function in the actual object.)

The virtual function table actually contains two more entries before the function pointers: a pointer to a "type ID" structure, a separate structure that describes the class itself, and a memory offset whose use we'll look at later. So if we have this C++ code:

class foo {
virtual void a(int);
virtual void b(float);
int c;
float d;
};

The equivalent C code would look something like this:

struct foo_vtable {
typeid * type_id_ptr;
int offset;
void (* ptr_to_a)(int);
void (* ptr_to_b)(float);
};

void foo_a(int) { }
void foo_b(float) { }

static type_id foo_type_id = { ... };
static foo_vtable sfoo_vtable = { &foo_type_id, 0, foo_a, foo_b };

struct foo {
foo_vtable * vtable; // gets inited to &sfoo_vtable
int c;
float d;
};

Class type is stored separately - first ptr in the v-table. Motivation for this will be clear later.
Vtable has an offset - also understood later.

Virtual Base Class - Not Needed for Interfaces

I was going to argue that virtual base classes are logical for interfaces, because we really only need each interface once. E.g.:

class IDeletable {
public:
virtual void delete()=0;
};

class IDrawable : public virtual IDeletable {
public:
virtual void Draw()=0;
};

class ISerializable : public virtual IDeletable {
public:
virtual void Serialize()=0;
};

void PersistentDrawing : public virtual IDrawable, public virtual ISerializable {
};

Turns out we don't need to virtualize the base classes. Virtualizing the base class is used when we have multiple copies of the base class inherited from different objects and we don't want actual copies of the base class.

But Don Box points out in "Essential COM" that this isn't necessary. In particular because the interfaces are truly empty - only a virtual function table, there is no harm in having multiple copies of them in an object. The virtual functions will be overriden correctly even if that base is represented multiple times. (The memory layout for this is moderately surprising, but I'll comment on this some other time.)

Tuesday, February 20, 2007

Inherit the Wind

Usually I post here after concluding something, usually after a discussion. But this post is the part that comes first - groping with the underlying issues.

Unfortunately a number of my software engineering books are temporarily inaccessible. We packed our (unsorted) library when we moved a few months ago, and because we are redoing a large chunk of the house, everything is still in boxes. As if having to search linearly for any book wasn't demoralizing enough to a programmer, some of the boxes are on top of each other, making whole ranges very difficult to get to. My spirit is broken by O(N).

Anyway, the issue is: given a series of interfaces to a conceptual hierarchy and an implementation hierarchy, how do you implement this in C++? The following gets very ugly very fast:

class IPoint {
virtual Point2 Get()=0;
};
class IBezierPoint : public IPoint {
virtual Vector2 GetCtl()=0;
};

class WED_Point : public IPoint {
virtual Point2 Get()=0;
Point2 internal_pt;
};

If you've done this kind of design before, you can see how we're about to go straight off the clifff...

class WED_BezierPoint : public, um, um.

The problem is that if I inherit from my base implementation (WED_Point) and my full supported interface (IBezierPoint) I pick up two instances of the interface IPoint. This will not make C++ happy and is almost surely not what we want. (The result will be a bunch of "method not overridden" warnings and probably a lot of incorrect polymorphic behavior.)

There are only three ways to solve this in C++ and they all suck in different ways:

1. Don't inherit implementation. This is what I would tell anyone else - inheriting implementation is evil, bla bla bla. In otherwords, WED_BezierPoint derives only from IBezierPoint and reimplements storage for its base point. Because WED_BezierPoint supports the IPoint interface (by way of IBezierPoint) clients still get the appearance of polymorphism via the interface, which is what we really care about.

The only problem with this is that we end up recoding a lot of implementation. Inheritence of implementation is almost always a lousy idea, but in this case I can't help but think that it's giving us something similar to database normalization. Still, one could definitely argue that all algoirthm code should be working with IBezierPoint or IPoint and thus the repeat-implementation is really a non-issue. So breaking the implementation hierarchy might be the least-skanky thing to do.

1a. It should be noted that we can make an implementation hierarchy and contain it in classes derived from the interfaces but not each other. This is a "bridge" pattern and will get us out of trouble in a lot of different design cases.

2. Don't inherit interface. In otherwords, IBezierPoint doesn't derive from IPoint. This solves the problem, but with a rather weird result: IBezierPoints might not be points at all.

You can argue that since these interfaces are meant to be run-time cast and checked anyway, that this is okay, but I think we lose something here. It would be nice to require that all bezier points be points - that is what we mean, so having the compiler require it is nice. I'd have to catagorize this fix as skanky.

(It should be noted that having independent interfaces is a good thing a lot of the time. Just maybe not when a very clear IS-A relationship exists.)

3. Make the interfaces virtual base classes. Do two C++ wrongs make a right? Probably not, but for the sake of argument, if all interfaces are virtual base classes then we can have the IPoint interface from two places at once.

One warning about this: if you make your interfaces virtual you need to subclass from them virtually in two places - both when one interface is derived from another and when a base implementation class derives from an interface. (The implementation derivation can be non-virtual - as long as all but one relationship is virtual, C++ will figure out what we mean, at least in CodeWarrior where I tested this.)

I suppose it should be noted that, as gross as using virtual base classes is, if you derive your C++ interfaces from some kind of common casting base (read: IUnknown) it has to be a virtual base class anyway.

Saturday, February 17, 2007

Prototypes - Your Annoying Friend

Depending on your C++ compiler, when you start a project, C++ may require a prototype declaration for functions or not.

You really want prototypes! Here's why:

According to some byzantine aspect of C/C++, if you use a function that isn't defined and you don't have C++ requiring prototypes, then C++ will assume that the function (a) exists and (b) takes arguments that are exactly like what you are passing!

Now if the function doens't exist, you don't need prototypes to help you - you'll get a link error later on saying "we never found this function". Prototypes are better because it doesn't take a full compile to find out.

But where things really get ugly is if the function does exist, but the calling conventions you passed in assume an implicit conversion. For example, imagine if your function goes like this:

void some_happy_func(float n);

Now you call it like this:

some_happy_func(2.0);

Without prototypes, this probably will cause a lot of pain. Why? Well, 2.0 is a double-precision argument! If some_happy_func's prototype exists, the compiler will create an implicit conversion. But if it doesn't, the compiler will assume some_happy_func takes doubles, and generate an incorrect stack (with a double on it). When some_happy_func actually runs, it will interpret the low 32 bits of that double as a float and all hell will break loose.

In my experience, when prototypes reuqirements are off it's really easy to lose track of whether you've included a needed header or not, causing possibly subtle bugs where functions fail sometimes based on the calling code's headers. If the failure causes a hard-to-detect bug it can take a while to unravel this.

Monday, January 29, 2007

Why Not Binary Blobs?

I am working on the data model for WorldEditor, the X-Plane graphical scenery editor. At risk of "life blogging" the development, the design decisions for WED illustrate some design ideas.

WED uses C++ objects to represent the user's data in memory (the "internal data model"). I'll comment at another time on why I made this decision. On disk, however, WED uses an SQLite database file. That's another blog post too.

So one must ask the question, why do we need an on-disk data-model at all? Why not just dump out the C++ object contents to a file?

One might say "because you can't write STL classes to disk verbatim due to their internal pointers and private structure." But...WED uses an object-based undo system that requires each object to know how to serialize itself to a buffer...this means that we've already written serialization code for all of those STL structures.

It would make development faster to just reuse the object serialization code, but the result would be a file format that is a side-effect of the implementation code. This isn't good if:
  • You want to edit the data from another application without having code interdendencies or
  • You want to refactor the code (which would cause object layout to change) or
  • You want to read a subset of the data. (The in-memory structure is, well, in-memory, so it assumes you have access to everything.
The problem is that using object serialization code is designing a file format - but without doing any of the usual work you might do in designing a file format.

In the short term you save time on writing file I/O code, but as soon as you change the object format you must write new code to read the old file format, so you pay the "cost" of that code eventually -- but you must write this code against a file format that wasn't really "designed" at all.

In particular with WED, we want the file format to be stable and low-change over a long period of time, because the kind of data that might be in a WED file can be useful over a relatively long lifetime.

Given this, I am writing an explicit file format up-front rather than use the object serialization mechanism.

Thursday, January 25, 2007

Windows Vista - Stable API

I'm not a huge fan of Microsoft or Windows Vista, and I do all my primary development on a Mac. But...

The same X-Plane code that runs on Windows Vista will run on Windows 98 SE.

In that time Apple has changed the API (introducing Carbon), the ABI twice (moving from CFM to Mach-O, and GCC3 to 4*) and the instruction set (PPC to Intel), as well as the compiler twice (MPW to X-Code, or more likely Metrowerks to X-Code, and GCC 3 to 4).

I love Apple dearly, but if I write an app and burn the source code, I wouldn't be surprisde if the binaries run on Windows 5 years from now. I wouldn't make that prediction for the Mac.

*Is it fair to call the upgrade from GCC 3 to 4 an ABI change or compiler change? Not really, they're not full ones, but besides the usual slight changes, the real problem is that GCC 4 generates run-time dependencies on shared libraries that were not available in older Mac Operating Systems. This is why X-Plane is not alone in requiring OS X 10.3.9 or higher - 10.3.9 is the oldest operating system that has the runtime for GCC 4. Applications that can run all the way back to OS X 10.2 are actually building their PowerPC executable code against a different deployment target, which requires extra makefile gymnastics.

Wednesday, January 24, 2007

Inheritance of Implementation is Evil

If you go to your first software engineering job interview fresh out of college, they might ask you: what are the three tenets of object-oriented programming? They're hoping you'll barf out "encapsulation, polymorphism, and inheritance" or something like that.

Of those, inheritance is probably the least important. They'll then ask why inheritance is a good thing, quite possibly hoping to hear those two dreaded words: "code reuse".

No!!!!!!!! Run for your lives! The sixty foot tall abominable derived classes are coming!

Inheritance of implementation happens any time you derive from a class that does things, and then try to change what that class does slightly by overriding part of its implementation. I would describe this as the code equivalent of an organ transplant - let's just rip out the pancreas, put a new one in (maybe it's not even the same species) and hope it all plays nice together.

I have come to the conclusion that inheritance of implementation is almost always a bad thing. I'm not saying never use it - there's no rule that always holds in software engineering. (Ha, sort that one out.) But in the case of inheritance of implementation, I think that, like virtual base classes, inheritance of implementation should be a red flag and cause for pause.

The problem with inheritance of implementation is that it makes it really easy to violate encapsulation, which is the most important thing in OOP. Parent-child classes make a difficult context to manage customization.

If you are going to do it, consider a known design pattern like "template method". Template method at least formalizes the relationship between parent and child class that get out of hand.

The temptation for inheritance of implementation is strong - when you've got an is-a relationship and the behavior seems to be driven by that relationship, how can you not want to "inherit" the behavior. But is-a describes a publicly described interface - the implementation code may have seams that aren't related to the is-a relationship. Pulling out pieces of implementation along these lines is asking for trouble.

The cure for inheritance of implementation is a "bridge" pattern - and I would go as far as to say there's no need to worry about whether the implementation is in a class hierarchy (use it only if it's useful) - the important thing is to make sure the implementation is designed based on what makes sense and not based on how the interface presents itself.

(I am not surprised to see Java in the Wikipedia article on "bridge" above - Java doesn't provide for multiple inheritance of implementation. I'm not a huge Java fan but in this case I think they got it 100% right in requiring coders to write a few more lines of code to avoid spaghetti.)

In my professional experience, inheritance of implementation happens a lot out of impatience - I have class A that almost does what I want, so I derive from it and make class B. The worst form of this involves declaring pieces of A that were not virtual to be virtual so that they can become overridden - changing the "API" of A after the fact.

The right thing to do would have been to refactor the code up-front; pull out from B the utility U that A wants to reuse, then A and B can call U. Eventually some one is going to do that refactoring, but it would have been a lot easier to see and do first when the implementation was all in A than when the implementation has been spit between A and B by random overriding of virtual methods.

Friday, January 12, 2007

Why Const is "Wicked Weird"

This might be surprising:

typedef char * c_string;
typedef const char * const_c_string;
const c_string x;
const_c_string y;
y = x; // this is illegal!

In order to understand this you need to know a few things about C++.

First: a type specifier has parts:
  • Some kind of typename, like int.
  • Possible a "cv" (const-volatile) qualifier like "const".
  • Possibly the word typedef.
  • Other stuff...
The order doesn't matter!!!! In other words:

typedef const int number_t;
typedef int const number_t;

Those are both the same. But it gets a little werider:

int typedef const number_t;

basically int, typedef and const are all parsed together - they can trade places.

Pointers however, most certainly can't. So these things are the same:

const char * t;
char const * t;

But this one is very different:

char * const t;

If you read the C++ spec, the syntax for the pointer part of a type-spec is actually * [cv-qualifier] -- that is, if the word const follows the *, it acts on the pointer.

The way I think of this is: the word const can move around, but it can never cross a *.

So now we can understand why the above didn't work...

typedef char * c_string;
typedef char const * const_c_string;

c_string const x; // this is char * const
const_c_string y; // this is char const *

When we view the "expansion" of the typedef with const moved around, we can see how these are not the same.

(You might wonder, if we had written const c_string whether it would be different from const c_string. Well, remember, the const can move around a named type without having any effect. Essentially the pointer has been "baked into" the type using typedef before we applied const. So const has to apply to the pointer, not to the data itself.)

Saturday, January 06, 2007

Ref Counting and Purging

X-Plane's memory management for graphic resources has changed over time.

Back in the old days, allocation was static - there was a fixed number of terrain textures, for example (600). Memory management was trivial, every resource went in its slot, and the code was really bullet-proof. On the other hand, it didn't scale well (meaning at all).

Parts of the code evolved to use a cache-and-retain system...each resource was loaded once the first time it was needed and retained forever. This was dynamic and good.

Unfortunately as the hardware is capable of blasting out more graphic data, scenery has gotten more complex, and now if we retain everything we tend to run out of virtual memory. So now X-Plane uses a reference-counting system to share resources but purge them when they're not used.

But the code isn't as simple as saying "if the ref count drops to zero, nuke the object". The problem is: consider two scenery files A and B that use three catagories of objects:
1. Objects only in A.
2. Objects only in B.
3. Objects in both A and B.

If we load scenery B before purging A, and we use simple ref counting, we will have all three sets of objects loaded at once. That's bad, because our total virtual memory footprint will temporarily spike up, possibly exhausting supply.

If we purge A and then load B, the objects in catagory 3 are deleted and immediately reloaded. That's unnecessarily slow.

So instead X-Plane uses the following rules:
  • Objects are retained when their reference count is zero.
  • All objects of reference count zero are explicitly purged via a global call.
  • scenery A is purged before B is loaded.
  • All new objects are loaded as lazily as possible, during flight.
X-Plane thus does this:
  1. Purge scenery A. (Class 1 and 3 objects ref count goes to zero.)
  2. Load scenery B. (Class 3 objects ref count goes back to 1. Class 2 objects ref count is 1 but they are not yet loaded. So far our memory usage for objects hasn't changed.)
  3. Purge all unused objects. (Class 1 objects are now nuked. Memory goes down.)
  4. Start lazily loading missing objects. (Class 2 objects are loaded. Memory goes back up.)
It's more complex than simple reference counting or garbage collection, but since virtual memory is a scarce resource, we do the extra work to keep our footprint small.