Programming

Double-checked locking: Why doesn’t it work and how to fix it

Double-checked locking is notoriously evil. While it can be a boon to efficiency, it’s tricky to do correctly. Lurking at its core are two issues fundamentals to multi-threaded programming. Even if you don’t need this pattern, understanding atomicity and visibility is a must for any modern programming.

The Pattern

Let’s look at what a double-check lock is. Rather than start with anything questionable, we’ll start with a relatively good form. Most programmers can’t likely explain why this code doesn’t work. Yet understanding the problem is a necessary if you ever intend on writing safe multi-threaded code. We’ll use pseudo-code to avoid any language specific issues now.

SomeType globalObject = NULL;

SomeType getSomeType()
{
	if( globalObject != NULL ); //1
		return globalObject

	locked_section
	{
		if( globalObject ) //2
			return globalObject;

		globalObject = new SomeType; //3
	}

	return globalObject
}

This pattern is used when you have an object which is initialized the first time it is required. In the above code you see that we start with a null object. At “1” if the code we check if it is initialized, if so, we can quickly return that value and avoid any mutex call. If not we enter the locked section. At “2” we check again to avoid a race condition with another thread — we’re now safe doing this in a mutex, no race. If the object doesn’t exist, we create it, assign it to the global reference. Now we leave the locked section and return the object.

locked_section: For most languages this will probably be a “mutex.lock” to enter the section and a “mutex.unlock” to leave the section. You might also have a “synchronized” keyword that would work.

The Atomic Problem

The simplest problem is one that most programmers will never actually experience. The reference we have to “SomeType” is some kind of value, most likely a pointer to the object in memory. These values are not a single byte in size. When you do your first check on “globalObject” perhaps some parts of the address have been set, but not all of them. The object isn’t null, but it isn’t the correct value either, thus you’ll be returning a broken reference.

//Thread A sets the value in four steps (four bytes)
globalObject.byte[0] = 0x12; //1
globalObject.byte[1] = 0x23; //2
globalObject.byte[2] = 0x45; //3
globalObject.byte[3] = 0x67; //4

//Thread B reaches the first just after at point "2" in the above,
//thus "globalObject == 0x00002312
if( globalObject != NULL ) //this passes, since it isn't null
return globalObject; //this returns a junk reference of 0x00002312

The reason most of us never have this issue is becuase most current hardware will copy the entire four bytes atomically. You’ll never see this tearing of the value. Even if the hardware isn’t doing this atomically the race condition is so rare you might still never see it — except for random crash reports from your users.

To correct this you have to explicitly use atomic operations. Most languages don’t offer atomics and expect you to use always locked sections. Some languages, like Java, provide guarantees for basic atomic copying of object references. What each language offers varies greatly. You cannot assume the constructs are the same and therefore must take the time to carefully understand each language on its own.

The Visibility Issue

Once you add threads, the precise flow of a program becomes vitally important. For many people new to concurrency it may come as a surprise that what your code does isn’t the same as what you wrote. The compiler may optimize statements and reorder things as it feels fit. The CPU also tends to do reordering and out-of-path execution. For a surface level example, consider how the following statement can be rewritten (in pseudo-code).

//what you wrote...
globalObject = new SomeType;

//...becomes
globalObject = allocate( sizeof SomeType );
SomeType.construct( globalObject );

The problem here is that “globalObject” gets a value prior to the object actually being created. This has a serious consequence for the statement “if( globalObject != NULL)” since that condition will now pass even though the object has yet to be created. You will return a pointer to unitialized memory!

It bears repeating that regardless of how you order your assignment, the compiler, and CPU, can conspire together to produce this version of the code. On the CPU side why this happens relates both to memory reordering and visibility. For this article it is suffice to understand that concurrently running threads have a slightly different view on memory.

Refer also to “CPU Reordering” and “CPU Memory“.

The Solution

So we have two issues that need to be fixed. First we need to ensure assignment of the pointer is atomic to avoid any partial value. Secondly we need a way to ensure our object is fully created before its pointer becomes visible to any other thread. Let’s expand the code to correct it.

atomic SomeType globalObject = NULL;

SomeType getSomeType()
{
	if( globalObject != NULL )
	{
		load-fence;
		return globalObject;
	}

	locked_section
	{
		if( globalObject )
			return globalObject;

		SomeType temp = new SomeType;

		store-fence;
		globalObject = temp
	}

	return globalObject;
}

Again, the easier part to explain is the “atomic” label. By declaring a variable as atomic, it ensures that you will never have partial values contained in it. The value you read will either be the new value (after an assignment), or the old value (prior to the assignment), it will never be somewhere in between. This solves the problem of producing junk pointers.

The second fix are the two “fence” operations. We assign to a temporary variable prior to the “store-fence” and to the real variable after it. Then before we read from the variable we issue a “load-fence”. What this combination does is ensure that everything which happens before the store-fence is visible in its entirety after the load-fence. We have effectively said that the construction of the object must be entirely completed prior to setting the “globalObject” pointer, and when we read that pointer we ensure we can read the construction in its entirety prior to actually using it.

Updated: See the comment thread started by FerdinandB to understand why the `load-fence` is inside the if block.

Note that a fence is actually a fairly low-level CPU instruction. Most languages won’t provide direct access to this function. If you don’t have access to such functions, or an equivalent store/load facility you simply cannot implement a safe double-check locking. Note that some languages expose this functionality with the “volatile” keyword: be very careful as this meaning of volatile is distinct from its meaning in C and C++.

If a language does offer atomic variables it probably adds the fences implicitly in the reading and writing of those atomic variables. For example, the new C++ standard offers atomic variables which also have memory ordering semantics. Instead of using an explicit fence you can use a combination of “acquire” and “release” semantics on the atomic pointer.

Recap

For reasons of atomicity and visibility best practices dictate that all shared memory is accessed within a locked section. This is not an unreasonable standard. Once you wish to break free of this pattern you will encounter a few major issues. Understanding alone is non-trivial; consider in the above we’ve really just glossed over what “visibility” really is.

Double-checked locking is a relatively mild introduction to problems of advanced concurrency. This example alone is very useful in production code and can avoid the cost of a mutex, which can be costly in some circumstances. Just make sure you understand exactly what you are doing before implementing anything with atomic variables. If in doubt, stick to the normal locked section approach.

Classic Approach

I was asked to provide an example of the classic approach, so I’ll add that here. Note that there is actually no classic “double-check” pattern, since without the above fixes the pattern cannot work. This is why the double-check pattern is considered evil, since you simply can’t implement it correctly without atomics and fences. Instead you’re left using just a mutex locked section like below.

SomeType globalObject = NULL;

SomeType getSomeType()
{
	locked_section
	{
		if( globalObject )
			return globalObject;

		globalObject = new SomeType;
		return globalObject
	}
}

This approach completely avoids the issues of atomicity and visibility. While easier to understand and program, it can be a bit less efficient. The added cost comes from the need to use a mutex on every call to the function. Exactly how much this costs depends greatly on the programming language, the operating system, and the underlying hardware. Refer to “How does a mutex work?“.

Categories: Programming, Use Case

Tagged as:

15 replies »

  1. This is the best explanation I have ever seen for why we should use atomic values. Thank you very much.

    • Thank you. Do be aware that on x86 and other modern hardware, most fundamental types will automatically be automatic, so you don’t get this tearing. However, if you were using some non-fundamental type, like a structure with multiple fields, the tearing becomes more likely.

      Though in C++, for example, marking something as atomic, even if it doesn’t need to be on the hardware, shouldn’t cost anything extra — the compiler should know it doesn’t need to handle it specially.

  2. Is the load fence necessary? No loads precede it. I can see why you might want a load fence after the test in line 6, but I don’t understand putting it before. BTW, I ( think I ) understand fences only in the context of x86 lfence and sfence and not any other architecture.

    Come to think of it, if a load fence isn’t put after line 6, could the CPU speculatively loaded part of the global object? Clearly, any load that could execute wouldn’t be able to use the address, but perhaps a static field that the SomeType constructor modified could cause an issue.

    • The fences always have to be matched (in the general case). Without that load-fence what the processor doing the loading sees could still (in theory) be kind of mixed. Essentially without the load fence that particular processor will not see the store fence. Thus just doing it right on the store side is not enough, each processor on its own must somehow synchronize (thus the load-instruction).

      Note that x86 is a special case since it has very strong ordering guarantees. I think a locked exchange on the set is enough, no explicit fence needed in either place. Plus the word is naturally atomic. Your compiler may reorder things however, thus some kind of fence/atomic is still required. This is the advantage of using C++ atomics, since on x86 they will actually compile as such: the store should become a locked exchange, and the load-fence will simply disappear.

    • I agree with Grady, I don’t believe the load fence is necessary. Here is why (I’ll write it in C++ to be more concise):

      std::atomic globalObject;

      Let A = if (globalObject.load(std::memory_order_relaxed))
      Let B = SomeType* temp = new SomeType;
      Let C = globalObject.store(temp, std::memory_order_release)

      Because of the release store at C, B must fully complete before C. If C has not completed, A is guaranteed to see NULL. If C has completed, then by the transitive nature, if A sees that C has completed (A sees non-NULL), it will also see that B has completed and you will get back a pointer to the fully constructed object.

      The only problem with A not being fenced is A is not guaranteed to see that C has completed. Thus, A could see NULL even when C has completed. However, the mutex lock provides the necessary fence such that the second check of globalObject will see that indeed C has completed and will return a fully constructed object.

    • This issue is about the contents of B. You have to assume another processor is allowed to predictively load whatever it wants in the absence of sync instructions. So it is true that it will only see a null, or a pointer to a fully completed object, this says nothing about the memory in that object. If it happens to have loaded data for that object prior to this point it is free to keep using that previously loaded data, which may not actually be the initialized data. The load-fence is there to force it to stop using previous knowledge of this object.

      If we go a bit more abstractly and look at the C++11/C11 standards themselves, you’ll also find that store/load fences are not defined completely on their own. That is, the effects of a store fence are described in relation to a load fence. This allows that without the load fence another core is essentially allowed to ignore the store fence. A store is only half an operation which doesn’t provide guarantees on its own.

  3. A question (because I have used a variation of this pattern before):
    If we depend not on the object reference itself but rather on, eg, a boolean flag *and* we are happy that booleans are set as an atomic operation on our chosen platform, are we any safer?

    Eg (prepare for formatting nightmare):

    SomeType globalObject = NULL;
    boolean globalObjectSet = false;

    SomeType getSomeType()
    {
    if( globalObjectSet ); //1
    return globalObject

    locked_section
    {
    if( globalObjectSet ) //2
    return globalObject;

    globalObject = new SomeType; //3
    globalObjectSet = true;
    }

    return globalObject
    }

    • No, this isn’t any safer than the unfenced version. The compiler, and CPU, are both free to effectively move your ‘globalObjectSet= true’ statement prior to the construction of the object. Even if not reordered on one processor, memory visibility could still mean that this true value is visible prior to the rest of the object.

      You absolutely need instructions which guarantee the ordering. You cannot rely on the order of your normal assigments/reads you’ve specified at source code level.

  4. I couldn’t leave an inline reply so I’m replying here…

    I’m still not convinced the load fence is needed. You said:

    “This issue is about the contents of B. You have to assume another processor is allowed to predictively load whatever it wants in the absence of sync instructions. So it is true that it will only see a null, or a pointer to a fully completed object”

    And then…

    “If it happens to have loaded data for that object prior to this point it is free to keep using that previously loaded data, which may not actually be the initialized data.”

    But that can’t happen. It can’t load the data of the object if it doesn’t have a pointer to it. As you stated above, it won’t get a pointer to the object until the object is fully constructed. Thus, no way to read partially initialized data.

    “The load-fence is there to force it to stop using previous knowledge of this object.”

    There is no previous knowledge. This code is for making sure you initialize the object pointer once and only once, and you get back a fully constructed object when you do get the pointer.

    BTW, I’m not arguing for arguing’s sake. So far I just don’t see a need for the load fence. I’m open to being convinced otherwise!

    • You can’t make an assumption that the processor doesn’t already see that memory behind the pointer. Without understanding the full cache architecture you just can’t know. The process may just coincidentally already have that memory in its load buffer. This can actually be quite common if you are created/destroying a lot of small objects and malloc/new keep returning similar addresses.

      More likely example, say you have an array of pointers. The array is a global variable known to all threads from start. You also have a shared index which is initially 0. The first caller, processor A, should initialize this value to a particular index and initializes the memory at that location in the array. The other processor B guesses that it will need some data of the array so it loads it all (this does happen). Then you set the index and make a store fence on A. Processor B then reads the index, and it gets the right index. However, since it previously loaded the array it does not have the correct array contents at that index. Without a load-fence it has no way of knowing about this data dependency and will proceed to use the previously loaded and incorrect contents.

  5. I know this is an old thread, but a question on KitD’s boolean flag approach:

    Suppose instead of

    locked_section
    {
    if( globalObjectSet ) //2
    return globalObject;
    globalObject = new SomeType; //3
    globalObjectSet = true;
    }

    the code was

    locked_section
    {
    if( globalObjectSet ) //2
    return globalObject;
    globalObject = new SomeType; //3
    if (globalObject != nullptr)
    globalObjectSet = true;
    }

    That is, a causal dependency on the outcome of the new is established. Does not that prevent the problematic code reordering? Or is this approach still fatally flawed, even with the assumption about atomic operations on booleans?

    • No, it doesn’t solve the problem. This causal dependency is irrelevant for the optimizer as it still doesn’t introduce any kind of memory order guarantee. An optimizer is still free to see this is always true and simply avoid the `if` statement.

      You need explicit memory operations and atomics to ensure the correct visibility of the pointer and its contents.

  6. Your implementation of Double-Checked Locking is broken..

    I don’t agree with other people in the comment section on the load fence not being necessary; it is absolutely necessary, but you are issuing it in the wrong place.

        load-fence;
        if( globalObject != NULL )
            return globalObject;
    

    If thread A creates the object and assigns a value to globalObject *after* thread B issues the load-fence but before checking globalObject
    you still end up with a pointer that may be pointing at a partially constructed object because synchronization has not completed.

    There is a strict inter-thread ordering between the store to globalObject in thread A (after the store-fence) and the load in thread B.
    Therefore, if thread B sees a value, thread A has released the data (happens-before relationship) and all B needs to do is issue the load-fence to acquire the data globalObject is pointing at.

    Another way of looking at it is that memory operations after a load-fence cannot be reordered with a load that is sequenced before it.
    With all memory operations sequenced after the load-fence, it does not prevent anything from being reordered.

    The solution is to issue the load-fence after you observe a non-NULL value for globalObject:

        if( globalObject != NULL )
        {
            load-fence;
            return globalObject;
        }
    
    • I’m uncertain. I understand what you’re saying but not sure if it can actually happen. For emphasis, if I instead wrote code like this:

      load-fence:
      if (globalObject.Ready)
      {
          print(globalObject.SomeValue)
      }
      

      This would not prevent this re-ordering:

      load-fence:
      var a = globalObject.SomeValue
      if (globalObject.Ready)
      {   
           print(a)
      }
      

      Which would not be desired.

      I’d like to say the conditional check of `globalObject == null` on the code forms a sequenced operation with `return globalObject`. The evaluation of that `return` must strictly happen after the conditional. However, if I think in terms of pages of memory, the page for the actual object might be coincidentally loaded by the CPU prior to the `if`. If that `globalObject` happens to point into that memory, the CPU is not required to load it again, as it is has no strict happens before/after relationship with `globalObject`). It’s likely this scenario just doesn’t happen on current CPUs, and they do see a causal relationship between the conditional and the initialized contents.

      So I believe you are correct. I will move the load-fence.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s