We’ve all heard that a CPU may reorder access to memory. Yet if you’ve looked further you’d also see that cache coherence ensures the memory is kept in sync at all times. Something doesn’t quite add up. If the memory is always coherent, how can any reordering affect user programs? And what actually is being reordered?

Memory Structure

In this article we’ll gloss over the fine details of how cache coherency actually works. We’ll take the view that the memory is available to all cores in its full and proper state. It won’t have any outdated segments, nor can it get out-of-sync. While this might seem like an unfair boon, it is, for most intents and purposes, how the memory actually behaves. We’re going to completely trust the CPU with respect to its caches and memory use.

If you’re adventurous you can delve into the Intel or AMD manuals and see what actually happens. It is known as the ccNUMA architecture.

A fine point we can’t really avoid is that of atomicity. The above memory rules hold true at the level of atomic operations only. In fact we can actually define atomic functions to mean exactly that: a read/write operation which behaves as though there is a single unified view of memory. Most CPUs define atomic functions for data sizes at least up to their native word size: 8-bytes for a 64-bit machine; 4-bytes for a 32-bit machine. Intel’s instruction set is extra generous allowing 16-byte double operations to be performed as atomic.

Instruction Reordering

Consider the following piece of code. It represents a very common paradigm in programming.

int data[5] = { 9, 9, 9, 9, 9 };
bool is_ready = false;

void init_data() {
  for( int i=0; i < 5; ++i )
    data[i] = i;
  is_ready = true;
}

void sum_data() {
  if( !is_ready )
    return;
  int sum = 0;
  for( int i=0; i <5; ++i )
    sum += data[i];
  printf( "%d", sum );
}

Now suppose we have two threads in the program, a reader thread and a writer thread. The reader thread just keeps calling sum_data. At some pointer the writer thread will call init_data to intiailize the data. Thinking about the threads we have tried to be clever: we don’t set the global is_ready boolean to true until all the data is initialized. Now we’ll consider why this, though a good thought, doesn’t quite work as intended.

The init_data function issues a series of store operations to the CPU. They are issued in this order, what is known as program order:

store data[0] 0
store data[1] 1
store data[2] 2
store data[3] 3
store data[4] 4
store is_ready 1

That is, in our intended order of operations the value of is_ready is false until we have written the numbers into the array. So the reader thread should keep seeing false until all the values have been stored. This is when reordering rears its ugly-albeit-highly-efficient head and messes with that logic. The CPU looks at your series of stores, decides it can do this more efficiently and actually executes this series of stores:

store data[3] 3
store data[4] 4
store is_ready 1
store data[0] 0
store data[1] 1
store data[2] 2

Now it is clear why the reader thread has a problem. It may get a true value for is_ready before the first three index values have been written. Indeed, if the reader thread wakes up immediately when is_ready is set it will sum up the values 9,9,9,3,4 instead of the intended 0,1,2,3,4.

It is good to consider that the CPU is actually reordering the instructions, as opposed to an abstract list of memory operations. Thinking in terms of instruction reordering lets us assume that once the instruction is finally executed, all cores will immediately see that new value in memory. It also lets us understand the CPU from the point of the user program rather than trying to understand the internals. For normal user programming this view is correct.

The First Fence

To fix the problem we’re going to extend our pseudo-code with locking operation. We’re not actually interested in the lock itself, but what happens at the CPU level when it is unlocked.

lock_type lock;

void init_data() {
  synchronized( lock ) {
    for( int i=0; i < 5; ++i )
      data[i] = i;
  }
  is_ready = true;
  return data;
}

Somehow a synchronized block, such as a mutex, is giving the CPU a hint as to how to handle the memory in that section. The above issues the following commands of interest when storing values:

store data[0] 0
store data[1] 1
store data[2] 2
store data[3] 3
store data[4] 4
fence
store is_ready 1

The fence is the interesting part. Right before we store the true value in is_ready our synchronized block ends. This causes a fence instruction to be executed. If you try to think about what this instruction actually does you’ll need to delve deep into the internals of the Intel, or AMD, instruction set manuals. Instead let’s look at logically what it says. It tells the CPU where it may reorder store commands. In particular it says all stores before the fence must remain before the fence, and all stores after the fence must remain after the fence. Thus the reordering of commands we saw previously is not allowed, so the CPU does this instead:

store data[3] 3
store data[4] 4
store data[0] 0
store data[1] 1
store data[2] 2
fence
store is_ready 1

It has still moved the third and fourth index stores, but it was forced to defer the is_ready store until after the fence. This is really great! It now means that our reader thread will never see a true value until all the indexes have been initialized.

The Second Fence

Or maybe not. For the same reason the CPU may reorder store instructions it may reorder load instructions. Flattening the code a bit, the sum_data function ends up issuing these load instructions in this program order.

load is_ready
load data[0]
load data[1]
load data[2]
load data[3]
load data[4]

The CPU again looks at this and figures it can do it more efficiently. So it decides to execute the instructions in this order:

load data[3]
load data[4]
load is_ready
load data[0]
load data[1]
load data[2]

This is a problem. It is possible that our reader thread first loads the data for index 3 and 4 (which are still their initial value of 9), then the writer thread initalizes the data and sets is_ready, and then the reader thread reads is_ready and the three remaining values 0,1,2. The reader will now sum up the values 0,1,2,9,9 instead of the intended 0,1,2,3,4.

Again we’ll need to introduce some kind of lock to get around this.

void sum_data() {
  synchronized( lock ) {
    if( !is_ready )
      return;
  }
  int sum = 0;
  for( int i  =0; i <5; ++i )
    sum += data[i];
  printf( "%d", sum );
}

In this case we are interested in what the lock does before it enters the locked section, not what happens afterwards. When compiled the loads of interest now looks like this:

load is_ready
fence
load data[0]
load data[1]
load data[2]
load data[3]
load data[4]

There is that fence again. In this case it will prevent the CPU from reordering any loads after the fence to before the fence. This ensures that we don’t load any of our data until after is_ready has been set to true.

Summing Up

The above gives some insight as to what is meant when the CPU reorders instructions. It also shows how a simple change in ordering can cause problems for concurrent threads. It is worth noting that reordering is only a problem with truly concurrent threads. Two threads being timesliced on a single CPU core won’t run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won’t really know about each other’s reordering.

Once you’ve understood you’ll immediately be wondering something else. Why on earth are we using a generic lock, or mutex, if all we needed was the fence command? That’s a good question, and there are two answers. The first one is that normally your code will require a lock in any case. Other than very specially designed algorithms you’ll need both the fence, and the additional locking features to protect your data.

The second answer is simply that most languages don’t provide you with a high-level fence command, nor would you truly want to use it. Optimizing at the CPU level with fences requires you have an understanding of how that specific CPU architecture works. Sure you could just stick in full fences everywhere, but it may be inefficient. The keen observer will already have noticed from the article that perhaps distinct load and store fence could be used. Some CPUs do have load and store fences. In fact, some CPUs make certain guarantees that one of those fences isn’t required at all! So for now, if you really want fences you’ll have to get familiar with the hardware and do some inline assembly work. If you wait a bit new language drafts will have explicit support, such as C++ 0x which includes atomic variables, which abstract fences in a platform independent fashion.

In any case, simply understanding instruction reordering is helpful. It gives a good solid reason why clever ordering of data assignment is not enough protection for concurrent threads. Fences are the low-level answer, but in most cases using them through a traditional mutex is likely the correct choice.

About these ads