Friday, July 03, 2009

CAS And Reference Counting Don't Mix

It took me a bit of head-scratching to understand this, but...

Atomic Reference Counting

You can use atomic operations to reference count objects. For example, if you look at the GNU C++ string implemention, you'll see that a string object is a pointer to a shared "real string". When a string is copied, the reference count is incremented. When the reference count drops to 0, the string that dropped the last reference deletes the buffer. ("Last one out, turn off the lights.")

The string doesn't need locks because it uses atomic operations to maintain the reference count. In particular, there is no race between thread A dropping the count to zero and thread B cloning the buffer again. If thread B is in a piece of code that can be cloning the buffer, thread B must be doing it via an object that has a reference to the buffer...thus thread A's decrement could not drop to 0 - at worst it could drop to 1 (B having the remaining count).

Compare and Swap (CAS) For Structures

CAS can be a useful way to modify a structure. A typical CAS operation will:
  • Make a local value of a "slot" to be modified.
  • Calculate a new value for the slot.
  • Attempt to CAS the new value in for the old.
  • If the slot contains a value other than the old value (the CAS fails) we know some other thread got in there instead. We loop around and retry.
For example, we can push a new object onto a list by first preparing our new head node and then CASing. If the head pointer changed, we know that our newly prepared node is not pointing to the old head and we need to retry.

CAS + Reference Count = Race Condition

One way to make a general purpose data structure lock-free and thread-safe is to have the structure accessible by pointer. The modify operation uses CAS - that is, the modifier clones the structure, makes the modifications, and attempts to swap the structure back in by pointer (using CAS). If the swap fails, the writer thread has to retry.

In order for this to work, we need to be sure that the old structure is not destroyed while anyone is reading it.

My immediate thought was: no problem - we can use reference counts. We will increase the reference count when using that old structure and not throw it out until everyone is done.

Such an algorithm would go something like this:
to read:
retain our struct
read it
release our struct

to modify:
retain our struct
copy it
modify the copy
attempt to CAS the structure back
if the CAS succeeds, release the old struct twice.
if the CAS fails, release the old struct once and the new struct once and retry.
(Where did I get those release counts? Well, the current struct is retained once as "the master copy". We retain it again to assure that it isn't deleted while we copy it. So if we successfully modify, we need to release it twice - once for us and once because it isn't the master copy any more. If the CAS fails, we release only our "don't delete while we copy" reference, and we release our new copy to deallocate it.)

Why This Fails

It turns out that you can't implement this structure using typical atomic operations (CAS and atomic increment/decrement). The problem is in the read. How exactly do we increase the retain count?

In GNU's string they do something like this:
atomic_increment(&my_buffer->reference_count);
And that works great, because my_buffer is immutable. But in our operation, the pointer to the struct is subject to atomic operations. And yet we have to dereference it to get to the reference count.

The code, when broken down, would look something like this:
  1. dereference my_buffer to find the effective address of the reference count.
  2. atomically modify that reference count at its address.
The problem is that between step 1 and 2 (these are not atomic with regards to each other) someone might first CAS the pointer to the shared object (and it won't fail since we haven't modified that pointer) and then decrease the reference count of the shared object to zero, deleting it.

Now for step 2 the address we have computed is already bogus! Doh!

This can be worked around if an architecture supports very strong conditional operations based on memory modification. In other words, if we can execute the atomic increment conditional on an unmodified pointer to the buffer, we can correctly "bounce out and retry" if someone switches the underlying object before we get our reference count in.

Of course, most of the machines in my office don't have that kind of hardware.

Hazard Pointers - Why They Work

There is an alternate design pattern that works around this: hazard pointers. The basic idea (to heavily oversimplify things) is this: there is a finite amount of memory (where each slot is written to by exactly one thread to avoid having to lock or control this memory) where a thread can declare "hey, I am using this structure".

Deleting structures don't delete a memory block if anyone is using it, which they determine by scanning the block of hazard pointers.

Why do hazard pointers work? They work because, unlike our reference counts, we don't have to first dereference a pointer to "protect" it (with the reference count). Since the hazard pointer is the pointer to the block, we can protect first, dereference second, something like this:
set our hazard pointer to the block we want to use
CAS the ptr to the block with itself and its old value to detect that it hasn't changed already.
If it has, loop back and try again.
If it has not changed, we are now safe to use it, because our hazard pointer is established.
The clever aspect of hazard pointers is that, by providing per-thread storage, the total number of places to check for hazard pointers can be relatively small, but no dereferencing is needed.

No comments:

Post a Comment