As high level languages become more abstract, and offer more convenience features, it is easy to lose track of what the computer is actually doing. While this may be a great boon to productivity, offering a more gentle learning curve for new programmers, it also leads to slower and less responsive programs. Many seemingly simply operations require a significant amount of processing time. In the last twenty years computing power has gone through exponential growth, yet none of our applications seem to run any smoother. Sure, they do quite a bit more than before, but the increase in performance should have more than covered that.
Yet in all of this we do have a few programs with exponentially improved efficiency. Most of our audio and video programs, including video games, offer some fine examples of this. But why do some programs run slower than others? Why do two very similar programs run at significantly different speeds. Is it the language they were written in? Is it the experience of the programmer? Is is some tiny feature in one and not the other?
In this series of articles I’d like to cover several topics relating to cost. We will look at how bits of code actually work and what load that produces through the system. Hopefully we’ll set you on the path to being able to write more efficient code and getting substantial improvements on feedback, response time, and throughput.
Let’s first look very generally at the source of the problems though: access, contention, and complexity.
Access
Reading and writing to a disk is perhaps an easy case to understand. Disks are still quite slow compared to memory. It’s best to try and use an asynchronous API — we all know that. But did you know that simply calling a write function has a cost? Anything which switches contexts, in this case to the OS kernel, has a non-trivial cost associated with it. You have no choice however! You will have to call OS functions and they will involve a kernel switch. Nonetheless there is ample opportunity to reduce the number of calls, and in many cases eliminate them all together.
Accessing memory also has a cost associated with it as well. The main memory of the system is on a completely different set of chips than the CPU. They are, relatively speaking, very far apart. Reading and writing that memory takes a considerable amount of time. And unlike disk access you can’t really use asynchronous memory access. CPUs instead have all sorts of tricks to predict and preload data, but if your program needs access to a particular memory location, it’ll have no choice but to wait for it to be loaded. Furthermore, using too much memory, or simply using it in the wrong order, can lead to significant bottlenecks in performance.
From disks to memory, bus control to graphics cards, context switches to network switches, everything you do has an associate access cost. Some are trivial yet can degrade quickly, others are slow but offer amazing bulk speed. Some are unavoidable yet others are simply wasted processing. Understanding the source of waiting can greatly improve your coding.
Contention
Shared resources are perhaps the least understood of performance problems. Everybody understands that a system has finite resources, and that they have to be shared somehow. Most programmers understand the basics of mutexes, or synchronized blocks / critical sections. It is however this very basic understanding that can get a program in a lot of trouble. With even a few false steps your supposedly multi-threaded program could leave most of the CPU cores idle. A few wrong steps in another direction and you can find the vast majority of execution time is spent switching contexts and doing nothing useful.
The cost of contention thus comes in two flavours: underused resources and wasted cycles. If your program has a lot to do every core in the system should be used. Needing resources, having them available, and not using them is somewhat of a tragedy. On the other hand, if you are doing too much at once the overhead might be so high as to negate any benefit from multi-threading. In some cases it can get so bad that a single thread executing sequentially can out-perform its threaded brethren!
Our dream of simply throwing more processors at a problem simply isn’t working out that way. It takes special care to use all those cores and not to waste cycles. This extends past the CPU into memory, disks, and external systems. Sharing them all efficiently can become quite difficult. There is a very good upside however. Once you understand the problem, and write appropriate code, you’ll find making full use of a system no problem at all.
Complexity
The classic realm of computer science is complexity theory. How many steps, and how much memory, does a program logically need to complete? Often you’ll simply need to know how to choose an appropriate algorithm. Storing structured data is so common that every language has standard collections for you. Yet no language has the most efficient general purpose one, a radix tree, or trie. Why not? Do you know how much those inserts and lookups in the standard collections actually cost?
Understanding complexity and redesigning code can bring amazing performance improvements. All other costs aside, if you can reduce the number of steps your program takes by tenfold, you’ll generally see a ten-fold improvement in performance. We can’t just ignore the other costs in favour of pure complexity analysis. Access times and memory contention play a huge role in choosing an algorithm. Often an array will outperform a tree even when complexity theory dictates otherwise.
But you say you never have to design an algorithm? Well then, either you just aren’t programming our you aren’t paying attention. Your entire code is an algorithm. What applies to the small scale also applies to the large scale. You don’t have to be a math genius, but understanding the basics of complexity is absolutely essential to producing efficient code!
Knowing the Cost
Performance improvements often come down to just understanding the costs involved. Whenever you type a few lines of code you should always have a basic understanding of what that will do at the OS and hardware level. Ignoring the cost will inevitably lead you to writing a slow, under-performing application. Don’t assume better hardware will help you either. Chip speeds are approaching theoretical limits and many devices are already on the slow-path of improvement.
So you need to understand the cost of your code. You don’t however need to have a firm grasp on every layer or every circuit on your motherboard. All you need is a proper mindset. You also don’t need to learn hundreds of techniques and tricks ahead of time. Once you start to understand the source of costs, you’ll find it quite easy to make improvements to your code.
I will thus endeavour, through several articles, to give insight into common bottlenecks, traps, gotchas, and to shed light on the various mysteries of how your code is actually working.