Efficiency

CPU Memory – Why do I need a mutex?

Multi-threaded programming calls for semaphores, synchronized blocks, mutexes, or whatever your language happens to call them. Most of us basically understand why we need them: to prevent multiple-threads from accessing the same memory. Have you ever stopped to think about what they actually do? Why do you need to protect the memory, that sounds like something the CPU should do. Also, in many cases, like a simple instantiation, locks seem to be required yet it isn’t easy to understand why. The answer lies in how the CPU organizes memory, and how it optimizes your code.

Memory Caches

For this article we’ll assume the simple case of writing a program with only private memory. That is, it doesn’t share it’s memory with any other device on the system, like a video card, network card, hard drive, or otherwise. We’ll assume the program, and the program alone, is interested in the memory it uses. All interactions with the outside world are done via OS system calls. This is actually a very common scenario.

Memory chips are located physically quite far away from the CPU. If the the CPU had to actually store and load values directly from those chips for each memory access, your computer would be extremely slow. To save itself some effort the CPU keeps a copy of some of this memory in a cache on the chip itself. When you wish to read a value from memory the CPU loads it in a block-wise fashion. When it comes time to writing the CPU buffers all your changes in that same block of memory and writes it sometime later.

Coherence

First off, when we speak of concurrency we mean a CPU with multiple cores executing code in parallel. A multi-threaded program running on a single core isn’t really an example of concurrent processing. In terms of memory the single core CPU system doesn’t worry about memory coherency at all — it’s one CPU always has a complete and accurate cache state.

The moment a second core is added, or a second CPU, the situation changes drastically. Each core and CPU will have its own caches — something has to keep them in sync. There is something quite interesting about our current chip designs in this regard: the memory doesn’t get out of sync. That is, despite what you think about how memory is working, each core actually has a correct and accurate view of the memory at all times. If two cores happen to be accessing the same memory the CPU takes care of ensuring they are both working with the correct memory. This is something known as cache coherence.

Wait, back up a bit. Did I just say that the memory never gets out of sync? I did. But we all know if we don’t use mutexes our program memory will become horribly corrupt. Something doesn’t seem right. So I should clarify. As far as the CPU is concerned it has a consistent view of the memory. However, and a rather big however, your CPU doesn’t concern itself with data structures, it only concerns itself with primitive types. That is where the problems arise.

Atomicity and cache of fundamentals

Each fundamental type, like an integer, byte, long and double (depending on architecture) can be loaded or stored atomically and without memory corruption. When one core writes an integer to the cache, any other core which then reads from that location will get that new value. To be clear, a core will not read an outdated value from the memory cache. Being atomic means that the entire integer will be written as well. It is not possible that only the first 2 bytes are written before another core reads the value. Atomicity also applies to the read side. When a core reads from a cache it will get the entire value at the time of the read, not half of a previous previous and half of a new value.

The moment you have larger data structures however things aren’t so simple. If your structure is comprised of 4 integers it will take 4 of these atomic writes to commit it to memory. A sequence of atomic operations is not atomic itself. During the writing of those 4 bytes another core can come along and start reading, in which case it may take two of the old values and two of your new values. This is the first role which locks play. If the code in both threads uses a lock around these 4 bytes, the second core won’t read any of the data until you have finished copying all of it. This is the most common and most understood use of synchronization.

Reordering

Okay, but I must not be telling you everything. If I was, then through clever use of ordering of your statements and assigning memory you’d be able to synchronize the cores without any mutexes what-so-ever. Consider that if you want to add an element to a list, you could first create the element, initialize all the members, and as a final step copy the pointer to the end of the existing list. We know that final operation is atomic, and by the time we do it all the memory in the data structure has already been set. Thus the other thread will either see no additional item, or one entire additional item.

But it doesn’t work! The other thread might see the new pointer before it sees the other initialized values. We have clever CPU optimizations to thank for this. Remember that bit about how memory access is slow. The caches takes care of a lot, but even that doesn’t go far enough. To truly maximize the use of even the fastest cache, the CPU changes the order in which data is written and read from that cache. So while you may have written values 1,2,3 in sequence, the CPU could easily chose to write in the order 1,3,2 instead. If this happens the other thread might get the 3, but not the 2!

This is why it’s actually not enough to just lock access to parts of memory, since the CPU could just reorder memory writes outside of that block. So our standard locks, like a mutex, have an additional feature: a fence. When a CPU encounters a fence it knows that all its reordering of loads and stores has to be complete before it passes the fence. This ensures that if two threads both use a fence, the second thread will see the complete set of written data and not just a part of it.

The Mutex

Obviously this is a rather rough view of how this all works, in particular how the fences solve the problem. They key to understanding the need for a mutex lies in the reordering of memory access and that lack atomicity in large data structures. The typical mutex solves both of these problems. It ensures that only one thread is executing a key piece of code at a time, which in turns limits access to a data structure. It ensures that the both threads have a full and proper view of that memory irrespective of any CPU reordering.

The mutex is an absolute necessity when doing concurrent programming. Given how much it does however it should not come as a surprise that it is not a cheap operation. I’ll get to that in another article however. For now just understand what the mutex is doing and why you need it in your code.

10 replies »

  1. Great read! If one was going to learn more on this subject, what reading material could you suggest? Books, blogs (besides this one ;) )

    • Thanks. I don’t know what else to recommend. I tend to scour the sources of things like glibc, read manual pages and linux kernel technical documents.

  2. nice article…

    One question:
    I have a single byte of global memory , to which two threads write (no read-modify-write) .The code runs on SMP. Do I need to use Mutex or any other locking mechanism?

    My guess:
    Since memory write by CPU is atomic, So I do not need mutex for atomicity. Since I am only writing to single byte global memory, there is no issue of memory re-ordering. I am not updating a complex data structure, where multiple LOAD-STORE is required.
    Is my understanding correct ?

    • It depends on what else you do with that byte of memory. If you never read that byte, ever, then it doesn’t matter and the writes could actually be removed. So I assume you read from that byte at some time. If you don’t care what value that memory has at that time then you’re fine without a mutex. That is, you could get any of the values recently written by any of the threads (hardware guarantees may limit the options a bit more).

      So in short, if you intend on using the value you need some kind of ordering mechanism. It need not be a full lock, there are fences and acquire release semantics (search through my other articles about that).

  3. Question:
    In 2 core system, If one core is writing a global and other is reading it at same time, will there be problem without mutex? (If I don’t mind reading old value)

    What will be the scenario if both cores are writing the global at same time?

    Thanks in advance

    • For reading it depends on the atomicity guarantees of the CPU. Some limited size of values, like int64, are guaranteed to be atomic read/write (as on x86_64). On these CPUs you will get either the old or the new value, but never a garbage value. If there is no guarantee than you will get an undefined value.

      With no mutex you have to be careful you ever get an update. Without any sync instructions the compiler may assume the value never changes and will never read a new value from memory — this often happens in loops which check a global condition.

      For writing the situation is about the same. If atomic then one value will win. If not atomic then the stored value is undefined. The interesting case comes when you write multiple values and read them from another core. Here the transitivity guarantees of the CPU come into play. For example, say you write A then B, and read B from another core. If B is the new value a transitive guarantee says that A will also be the new value. Again, you still need some sync operation in the language to tell the compiler to keep the ordering.

  4. Thanks for great article. i have an question. Does the mutexes use the memory fences even in its costructor functions, because if it’s not, there are we have problem like this: Consider, in global scope i have constructed an mutex, then in main thread i have constructed an thread object and then it’s started and attemted to lock previously constructed mutex. What happens if previously constructed mutext object is not visible for second thread, (maybe cpu doesn’t constructed it really). So we need to memory fences in mutex’s constructor function, right. But i didn’t see anything like this on QMutex’s constructor function (Qt Framework’s Mutex Class). It should be, am I wrong?

    • You’re correct that the mutex itself must be properly protected, but it’s uncommon that it becomes an issue. Note that objects created prior to new threads are guaranteed visible in those new threads, so global mutexes are generally safe.

      For dynamic mutexes you need some lookup mechanism, either attached to an object in a collection, or registry of some kind. The lookup mechanism itself needs to be synchronized, which will naturally synchronize the mutex as well.

      It’s a bit hard to imagine a scenario where you’d have just an unsynchronized mutex — chances are there is some more significant data object that need sync’ing.

      Note, I should also mention that not all mutex implementations will really need a sync method. If you use a light-weight integer in user-space it may be atomic on most platforms. On other platforms, like Windows, the underlying mutex may be an OS call and it will guarnatee appropriate sync for you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s