Multi-threaded programming calls for semaphores, synchronized blocks, mutexes, or whatever your language happens to call them. Most of us basically understand why we need them: to prevent multiple-threads from accessing the same memory. Have you ever stopped to think about what they actually do? Why do you need to protect the memory, that sounds like something the CPU should do. Also, in many cases, like a simple instantiation, locks seem to be required yet it isn’t easy to understand why. The answer lies in how the CPU organizes memory, and how it optimizes your code.
For this article we’ll assume the simple case of writing a program with only private memory. That is, it doesn’t share it’s memory with any other device on the system, like a video card, network card, hard drive, or otherwise. We’ll assume the program, and the program alone, is interested in the memory it uses. All interactions with the outside world are done via OS system calls. This is actually a very common scenario.
Memory chips are located physically quite far away from the CPU. If the the CPU had to actually store and load values directly from those chips for each memory access, your computer would be extremely slow. To save itself some effort the CPU keeps a copy of some of this memory in a cache on the chip itself. When you wish to read a value from memory the CPU loads it in a block-wise fashion. When it comes time to writing the CPU buffers all your changes in that same block of memory and writes it sometime later.
First off, when we speak of concurrency we mean a CPU with multiple cores executing code in parallel. A multi-threaded program running on a single core isn’t really an example of concurrent processing. In terms of memory the single core CPU system doesn’t worry about memory coherency at all — it’s one CPU always has a complete and accurate cache state.
The moment a second core is added, or a second CPU, the situation changes drastically. Each core and CPU will have its own caches — something has to keep them in sync. There is something quite interesting about our current chip designs in this regard: the memory doesn’t get out of sync. That is, despite what you think about how memory is working, each core actually has a correct and accurate view of the memory at all times. If two cores happen to be accessing the same memory the CPU takes care of ensuring they are both working with the correct memory. This is something known as cache coherence.
Wait, back up a bit. Did I just say that the memory never gets out of sync? I did. But we all know if we don’t use mutexes our program memory will become horribly corrupt. Something doesn’t seem right. So I should clarify. As far as the CPU is concerned it has a consistent view of the memory. However, and a rather big however, your CPU doesn’t concern itself with data structures, it only concerns itself with primitive types. That is where the problems arise.
Atomicity and cache of fundamentals
Each fundamental type, like an integer, byte, long and double (depending on architecture) can be loaded or stored atomically and without memory corruption. When one core writes an integer to the cache, any other core which then reads from that location will get that new value. To be clear, a core will not read an outdated value from the memory cache. Being atomic means that the entire integer will be written as well. It is not possible that only the first 2 bytes are written before another core reads the value. Atomicity also applies to the read side. When a core reads from a cache it will get the entire value at the time of the read, not half of a previous previous and half of a new value.
The moment you have larger data structures however things aren’t so simple. If your structure is comprised of 4 integers it will take 4 of these atomic writes to commit it to memory. A sequence of atomic operations is not atomic itself. During the writing of those 4 bytes another core can come along and start reading, in which case it may take two of the old values and two of your new values. This is the first role which locks play. If the code in both threads uses a lock around these 4 bytes, the second core won’t read any of the data until you have finished copying all of it. This is the most common and most understood use of synchronization.
Okay, but I must not be telling you everything. If I was, then through clever use of ordering of your statements and assigning memory you’d be able to synchronize the cores without any mutexes what-so-ever. Consider that if you want to add an element to a list, you could first create the element, initialize all the members, and as a final step copy the pointer to the end of the existing list. We know that final operation is atomic, and by the time we do it all the memory in the data structure has already been set. Thus the other thread will either see no additional item, or one entire additional item.
But it doesn’t work! The other thread might see the new pointer before it sees the other initialized values. We have clever CPU optimizations to thank for this. Remember that bit about how memory access is slow. The caches takes care of a lot, but even that doesn’t go far enough. To truly maximize the use of even the fastest cache, the CPU changes the order in which data is written and read from that cache. So while you may have written values 1,2,3 in sequence, the CPU could easily chose to write in the order 1,3,2 instead. If this happens the other thread might get the 3, but not the 2!
This is why it’s actually not enough to just lock access to parts of memory, since the CPU could just reorder memory writes outside of that block. So our standard locks, like a mutex, have an additional feature: a fence. When a CPU encounters a fence it knows that all its reordering of loads and stores has to be complete before it passes the fence. This ensures that if two threads both use a fence, the second thread will see the complete set of written data and not just a part of it.
Obviously this is a rather rough view of how this all works, in particular how the fences solve the problem. They key to understanding the need for a mutex lies in the reordering of memory access and that lack atomicity in large data structures. The typical mutex solves both of these problems. It ensures that only one thread is executing a key piece of code at a time, which in turns limits access to a data structure. It ensures that the both threads have a full and proper view of that memory irrespective of any CPU reordering.
The mutex is an absolute necessity when doing concurrent programming. Given how much it does however it should not come as a surprise that it is not a cheap operation. I’ll get to that in another article however. For now just understand what the mutex is doing and why you need it in your code.