Efficiency

How does a mutex work? What does it cost?

Concurrent programming requires synchronization. We can’t have more than one thread accessing data at the same time; otherwise, we end up with a data race. The most common solution is to wrap the critical data access in a mutex. Mutexes are, of course, not free. A mutex can have a significant impact on the cost of the code we are writing. When used correctly we’ll barely notice the overhead. When misused, it can cause a program to run worse in threaded mode than it would have single threaded!

You may also be interested in reading CPU Memory – Why do I need a mutex?.

What is a mutex?

A mutex, in its most fundamental form, is just an integer in memory. This integer can have a few different values depending on the state of the mutex. When we speak of mutexes, we also need to speak about the locking operations. The integer in memory is not intriguing, but the operations around it are.

There are two fundamental operations which a mutex must provide:

  • lock
  • unlock

A thread wishing to use the mutex, must first call lock, then eventually call unlock to release it. There can be only one lock on a mutex at any given time. The thread holding the lock is the current owner of the mutex. If another thread wishes to gain control, it must wait for the first thread to unlock it. This mutual exclusion is the primary goal of the mutex, and indeed the origin of the name.

The lock operation has several variants. In most cases, we’d like to wait until we can lock the mutex, so the most common lock operation does precisely this. Other time we may wish to only wait for a given period, and sometimes we may not want to wait at all.

In contrast, unlock is usually just one function. It makes a mutex available for another process to lock.

Contention

Attempting to lock an already locked mutex is called contention. In a well-planned program, contention should be quite low. You should design your code so that most attempts to lock the mutex will not block.

There are two reasons why you want to avoid contention. The first is that any thread waiting on a mutex is not doing anything else — possibly resulting in unused CPU cycles. The second reason is more interesting, in particular for high-performance code. Locking a currently unlocked mutex is cheap compared to the contention case. We have to look at how the mutex works to understand why.

How does it work?

As mentioned before, the data of a mutex is an integer in memory. Its value starts at 0, meaning that it is unlocked. If you wish to lock the mutex, you check if it is zero and then assign one. The mutex is now locked, and you are the owner of it.

The trick is that the test-and-set operation has to be atomic. If two threads happen to read 0 at the same time, then both would write 1 and think they own the mutex. Without CPU support there is no way to implement a mutex in user space: this operation must be atomic with respect to the other threads. Fortunately, CPUs have a function called “compare-and-set” or “test-and-set” which does exactly this. This function takes the address of the integer and two values: a compare value and a set value. If the comparison value matches the current value of the integer, then it is replaced with the new value. In C style code this might like look this:

int compare_set( int * to_compare, int compare, int set );

int mutex_value;
int result = compare_set( &mutex_value, 0, 1 );
if( !result ) { /* we got the lock */ }

The caller determines what happens by inspecting the return value. It is the dereferenced to_compare pointer value before the swap. If this value is equal to the compare value the caller knows the set was successful. If the value is different, then the call was unsuccessful. When the section of code no longer requires the lock, it can set the value back to 0. This makes up the fundamental part of our mutex.

Atomic increment/decrement functions could also be used and are the recommended way if using the Linux futex.

What about waiting?

Now comes the tricky part. Well, only in a way is it tricky, in another way it is simple. The above test-and-set mechanism provides no support for a thread to wait on the value (aside from a CPU intensive spin-lock). The CPU doesn’t fully understand high-level threads and processes, so it isn’t in a position to implement waiting. The operating system (OS) must provide the waiting functionality.

For the CPU to wait correctly, a caller needs to go through a system call. It is the only thing that can synchronize the various threads and provide the waiting functionality. If we have to wait on a mutex, or release a waiting mutex, we have no choice but to call the OS. Most OSs have built-in mutex primitives. In some cases, they provide full-fledged mutexes. If a system call does provide a full mutex, why would we bother with any test-and-set in userspace? The answer is that system calls have quite a bit of overhead and should be avoided when possible.

Various operating systems diverge at this point, and will likely change as time goes on. Under Linux, there is a system call futex which provides mutex like semantics. If there is no contention, the call is resolved in userspace. If there is contention, the call is delegated to the OS to handle in a safe, albeit far costlier manner. The OS handles the waiting as part of the process scheduler.

futex is quite flexible in allowing the creation of various locking mechanisms in addition to a mutex, such as a semaphore, a barrier, a read-write mutex, and event signalling.

The Costs

There are a few points of interest when it comes to the cost of a mutex. The most vital point is the waiting time. Your threads should spend only a fraction of their time waiting on mutexes. If they are waiting too often, then you are losing concurrency. In a worst-case scenario many threads always trying to lock the same mutex may result in performance worse than a single thread serving all requests. This isn’t a cost of the mutex itself, but a serious concern with concurrent programming.

The overhead costs of a mutex relate to the test-and-set operation and the system call that implements a mutex. The test-and-set is likely a minuscule cost; being essential to concurrent processing, CPU vendors have a strong incentive to make it efficient. We’ve ignored another important instruction, however: the fence. This is used in all high-level mutexes and may have a higher cost than the test-and-set operation. I talk a bit more about fences in CPU Memory – Why do I need a mutex?. Most costly however is the system call. Not only do you suffer the context switch overhead, the kernel now spends some time in its scheduling code.

Advertisements

14 replies »

  1. Windows has the CRITICAL_SECTION, which sounds like it is the same as the Linux futex. A CRITICAL_SECTION tries doing the test-and-set in user space. If that doesn’t work then it can be told to spin for a while (on multi-core systems). If that doesn’t work then it uses a kernel mutex to wait. Thus, it gives great performance, without ending up in a long spin loop.

    I’ve seen many people try to implement their own mutexes. They all handle contention poorly. Either they yield under contention with Sleep(), meaning that they stay idle too long, or they spin in a busy loop for hundreds or thousands of milliseconds, wasting power and causing priority inversions. Some of them even omit memory barriers so that they are inefficient *and* incorrect.

    Use the operating system primitives. Set a spin count on a CRITICAL_SECTION and use it rather than rolling your own. Please.

    • A critical section is a bit higher level, but definitely useful in the same way. I’d suspect internally it uses a private windows API call similar to futex that allows it to work first in user space and then defer to the kernel. However achieved, non-contention handled strictly in user-space is basically a requirement nowadays, otherwise safe threaded code is too slow.

      But you’re absolutely right, somebody creating a custom synchronization object should definitely be building it on an OS provided function. Both wasting resources and screwing up the memory visibility as you say. I do a lot of work with concurrent processing using atomics, so I’ve experienced first hand how non-trivial this can be.

    • I’m reading this as “How does a thread get put into a waiting state when the lock is not available?”

      The actual locking is handled by the OS. It needs to expose an actual low-level mutex mechanism, like `futex`, that will actually lock the process and wait on the mutex. The “What about waiting” section describes this.

  2. Thanks for the article, simple and helpful. I have a question, is it random choice, who will get the ownership, if couple or more threads are waiting on a lock mutex?

    • This is OS dependent. On Liunx the FUTEX_WAKE op gives no guarantee as to what thread is awoken.

      A program must be designed to avoid contention, or in a queue-like system, just not care which thread awakes. It should not rely on any fairness to the lock allocation.

  3. Thanks for the simple and concise article.

    I have one question in regards to waiting scenario:
    If we want to use CAS instruction in User Space, then what is the harm if we:
    1) Use counter in while loop, and yield CPU, if counter exceeds threshold value. (user space Spin lock)
    2) Or just yield the CPU, as soon as CAS instruction fails.(When we know that lock section may be bit big)

    I assume that price to pay is the context switch, but we can never avoid it, as we need CPU to have some useful work rather than waiting.

    • Spinning with a CAS is useful in some situations, and I did it on one of my projects. It’s helpful if you think contention is unlikely and extremely short-lived. There are a couple things to note:

      – It’ll obviously only work if you have multiple cores and the thread holding the lock is currently running.
      – The pre-emptie multitasking slices are very large compared to these operations, so you will hold the core for a long time before it is taken away
      – The core is at 100% use during this spin

      If you are doing a lot of IO operations, or other things that already involve context switches, you’re correct you won’t gain much from a few spin locks. They are genrally only useful when you have threads that can a lot of processing without needing to otherwise wait. We had some special threads like this on our finance project; they were pure processing, all IO was done on other threads. It’s a hard model to get right and actually do something useful.

  4. Thanks for simple, straight forward explanation.
    Which system call is invoked when there is contention for locks in user space? Can i assume it is futex call?

    • Yes. On linux at least it is a `futex` system call. As you said, this is only done when there is contention, or when something voluntarily wants to pass control to the kernel. Non-contentious locking is handling in user space, with a compare-and-swap or similar instruction.

  5. Which mutex lock implementation is fastest on commonly implemented OS designs? For example are spinlocks, plain mutex locks, read / write locks, of these which are the fastest on common designs?

    • There is no absolute fastest design. Which lock performs the best depends a lot on how it is being used, and what needs to be accomplished.

      In all cases, contention is the biggest performance drawback. You don’t want threads waiting on each other if you can avoid it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s