Tags

, , , , , , ,

We all know that operations in a computer happen one after the other. Having multiple CPUs and instruction reordering tends to obfuscate the issue. Technically things still proceed in an orderly fashion, but logically the interleaving of instructions and memory may seem perplexing. Often we need some strict controls to ensure our threads cooperate with each other. For the most part we have mutexes, but sometimes something more low-level is needed: a compound instruction that truly behaves as a single instruction across all the cores. We need an atomic operation? But what is it, and how are they useful?

The Basics

There are two widely used, and virtually universally implemented, atomic operations. They are increment and compare-and-swap (also known as test-and-set). In C code an increment of a variable could look like a++. However, to understand why atomic is important let’s look at this broken into individual steps.

int tmp = a;
tmp = tmp + 1;
a = tmp;

It may seem silly to assign to a temporary just to increment. The purpose here it is to show that if a resides in memory somewhere, the CPU can’t just increment it directly. It needs to copy it to a local operative space. Exactly where, and what this means, depends greatly on the chip, for now we just need to know a copy of it exists. Once it does the operation it will then copy the value back to memory, storing it in a.

Let’s assume the other thread wishes to change the variable based on a condition, a compare-and-swap operation.

if( a == -1 )
  a = 10;

Assume that a starts with the value -1 and thread one loads that value. It then sets its temporary to be a value of 0 (the increment). Now thread two arrives at if( a == -1 ). If will pass this if statement since a is still -1: the first thread hasn’t stored its value yet. So thread two proceeds and stores 10 in the variable. Then thread one completes its code and stores 0 in the variable.

This is a race condition. With these small bits of code we also can’t tell whether the value of a should now be 0, 10, or 11. All three of those values will actually happen if you run the code often enough. Importantly what can happen is largely determined by how the code is compiled, and whether it runs on a single CPU core, or truly concurrently. That is, many things can influence its behaviour and this is obviously not what we want.

For the curious, the various CPUs and cores have an additional burden of synchronizing memory between them. Don’t let this distract you from the above logic. Essentially each core is guaranteed to see the same stored value of a at any given time, this is known as the ccNUMA architecture. Focus on the possible interleaving of instructions and think about where the load and store operations actually happen.

The Atomic

In our first bit of code what we really wanted was to increment a. That is, all the threads on all the CPUs will immediately see, and use, our new value of a without anybody interrupting the operation. For this we need an atomic operation. In terms of mutexes you can consider the above code to look like below.

//increment
mutex.lock();
a++;
mutex.unlock();

//compare-and-swap
mutex.lock();
if( a == -1 )
  a = 10;
mutex.unlock();

Your code should have the above recurring pattern in many places. Everybody knows you lock a resource to modify it. For an operation like increment however you simply don’t need this. The CPU architecture provides fundamental functions which operate like the above, albeit with far lower operating cost.

If you want to experiment, in GCC the above two operations can be done atomically using the functions __sync_add_and_fetch and __sync_bool_compare_and_swap. There are many other variants on these functions. They save you having to write a bunch of non-portable inline assembly code. Other compilers should have equivalents, or you can also play with the atomic variables in the new C++ draft.

Why?

You should be asking yourself why you would like to do this? We mentioned they have a far lower cost than using a mutex to modify memory. In fact it goes a bit further than that. Each time you lock/unlock a mutex your thread may be forced to wait for some other thread; that is how mutexes work. Beyond just waiting, you’ll be forced to switch into the OS kernel which involves more overhead and a large delay in all of the involved threads. Your simple increment operation, which in itself takes just a few nanoseconds, can easily balloon into hundreds of nanoseconds, if not several milliseconds, due to kernel mechanics.

Putting all performance aside, they still have purposes. Many thread synchronization techniques are simply not possible without an atomic operation. Something seemingly simple like a mutex actually uses these operations. At the lowest levels, these atomics are pretty much the only way that one CPU has to synchronize data access with another CPU (or core). Without atomics you’d be stuck with unresolvable race conditions like in our example.

Why Not?

Atomic operations tend to only work on small fundamental data types. You can’t use them to alter several fields of a structure at once. For those high-level operations you still need to use a mutex. In fact, atomics are only useful with specialized concurrent algorithms that have been designed to use them. Sometimes these algorithms are simple, other times they are rather complex. For example, perhaps we all know of places where a simple atomic increment would be great, and I suggest using it in those places. Though the reason to use a compare-and-swap may be elusive.

But the atomic operations are there. Ignoring their underlying chip mechanics their logical operation is not mysterious. They are very simple operations and thus quite easy to understand. They have tremendous speed advantages over mutex locking for specially designed algorithms. Your OS and multiple core system basically wouldn’t work without them.

Advertisements