Beware the casual performance test, it will lie to you. It is simple to write something that gives you timings, but it is difficult to make those timings meaningful. A whole host of things can invalidate your results. Your timer could be broken or have too much overhead. As isolated test may run a lot faster than production code due to the processor cache. Perhaps the compiler is simply skipping code in your test. You need to know about these issues if you intend on writing meaningful performance tests.
This article is primarily focused on measuring things with short periods of execution — the parts which make up the whole. I don’t however go into much about how longer processes are broken down for such testing.
Are you actually measuring the time of your code, or something else? The OS may decide to delay your thread in favour of another one. Or perhaps it has moved your code to a different processor, thus suffering switching overhead. If you are using a wall clock (real world time) you will be measuring everything that happens between your two timer calls. For any code that takes a long time this may affect the results.
This is a difficult problem, as it is virtually impossible to tell when your thread has lost active use of a processor. The simplest solution is to ensure you have enough CPU resources for all threads running at the time. That is, make sure nothing else is running. If you have only a single core, or even two, this becomes quite difficult since some other process is always cropping up.
I recommend against trying to use any measurement of “actual” CPU time used. The resolution of these timers is extremely poor, and their correctness is also in question. Additionally, if your code is responsible for system calls it seems only fair that the overhead of that system call should be included in your measurements — it is a part of your code.
Though its nice to think of our computers as deterministic machines, they don’t really behave that way. Even in the most tightly controlled situation you will get variations in your time. Collecting many samples is an absolute must for a valid performance test. If your goal is to optimize the code you will need to combine these results into manageable values. This typically means averaging the results.
But beware the lies of averages. Certain code, especially that which allocates memory, may have a huge variation. For this reason my timings always report the minimum, maximum, mean, and standard deviation. Even if the mean time is good a high variance may cause occassional stalls in your application. For time sensitive operations this can be a major issue. Often I’ll do optimization work not to improve the mean, but to reduce the variance. In some very critical cases you may wish to even allow the mean to increase if it means bringing the variance down.
Finding a timer appropriate for time based performance testing is vital. The best timer is one that is capable of measuring a single iteration of the code you wish to test. If your code takes about 50ms you will want a timer that has at most 1ms resolution and a minimal overhead. On Linux I tend to use clock_gettime. It’s performance varies a lot on the kernel version and actual chipset. If you have a recent kernel and a newer chip this is implemented entirely in user-space and gets nanosecond level resolution. However, it still has a 20-50 nanosecond overhead (mainly the cost of the chip instruction).
If your operation is too close to the overhead you will need to run your code in a loop to get a reasonable rating. A rough way to check this is to start at one iteration of your code, and start doubling it (or times 10). If you see a significant drop in the per iteration time there’s a possibility chance timer overhead is causing it (it is unfortunately not the only reason this might happen). As you increase the loop count your overhead stays constant and thus spreads over more values. Stop once the value stabilizes to get the minimum loop size for this code.
Code Isolation and The Cache
If you are looking to test the performance of a particular algorithm, isolating that bit of code seems to make sense. You setup a nice timer and run the code thousands of times in a row. This gives you a nice number to work with while trying to improve the algorithm. With some changes you manage to get a significant speed improvement. Then you run a higher level integration test and see absolutely no improvement in time, or perhaps even worse.
You may have run into a cache issue. The biggest problem with isolating tiny bits of code is that they essentially gain exclusive access to the processor caches, both for data and code. If this code is actually run a lot in quick succession in the code then your test may be valid. If this code however is not run very often, and rarely more than once or twice in sequence, your minor improvements may show nothing at all. In the latter case your time may be entirely dominated by loading the code and data into the processor cache.
The localilty of your data is one of the aspects which is often hard to detect in isolated tests. Under program conditions you will allocate memory at various times, and thus the resulting data will be scattered throughout memory — this has a huge negative impact on any algorithm. In your test there is a good chance all your allocations are sequential, thus your data has good locality and your timings will be oblivious to this issue.
Note on Domain
This type of issue is only relevant for a limited number of scenarios. If you are measuring in the millisecond range then the time to load “code” is not likely relevant. Generally you have to be measuring at the microsecond range to see the impact of loading code — or perhaps just have an excessive amount of code. Your data also has to be small enough to benefit from the lowest cache level. If your data size is more than twice the L1 cache size each iteration will have to load the entire data again. Your timing tests won’t be cheating in this case and cache performance becomes an integral part of your results.
However, the lessen of this result extends to all domains. Consider if you are doing a database test. If you execute queries in isolation the database gets to dedicate all of its resources (memory, cpu, disk) to performing this query. Anytime resources come into play (which is most of the time), the issue of exclusive access becomes a major issue in influencing timing results.
As this is performance testing you will typically have full optimizations turned on in your compiler. If you’ve never paid much attention before you might be shocked at the number of things the optimizer will do. Of course it will make things go faster, but it might actually render your performance test useless.
The compiler doesn’t see a difference between “real” code and “test” code. When the optimizer looks at what it can do it will simply consider all of your code together. Unlike your production code the test code tends to have a lot of constant values used for test data. There will be tight loops and a limited amount of variation. This type of setup provides the optimizer with a lot of possibilities. If it sees shortcuts to the calculation it will take them. If it sees the results are not used it may not even call the code at all.
With any reasonable size block of code, layered through several functions, the optimizer won’t likely invalidate your test. But as you isolate what you are testing and the size of the tested code decreases significantly, optimizer meddling becomes a real possibility. Here it is handy to know some of your compiler specific extensions, at least something like stopping the inlining of a function. It is also vitally important you use the result of the code you call, lest it simply be omitted. Either print some value or write it to a volatile memory location.
There is no perfect solution here. You have to be aware of what is happening and be attentive to results that just don’t seem realistic. You can’t entirely cripple your code since that would also be unfair. The optimizer does apply to the production build, just with a slightly different setup.
As one way to avoid the issues of interruptions in your timer you can try and give your thread exclusive access to a particular processor. The first thing you will set is processor affinity, which indicates you want a thread to run on a particular processor. This doesn’t actually exclude other threads from running on that processor though. You can also make your thread a real-time priority which the kernel will then give top priority on that processor (nothing else runs until it is done).
Unless you are able to control the affinity and priorty of a thread in your actual program I don’t recommend you take this approach during performance testing. It will likely provide better results, but will they be realistic results? The averaged results running in a normal thread may be more accurate.
This is actually a good avenue to improving performance in time critical systems: specifying all affinities and increasing thread priorities. It helps for the reasons above which negatively impact performance tests. There will be fewer threads interrupting and the cache use will be more exclusive for this smaller set of code and data. Getting this right is however not a simple topic, so just stay away from it unless your willing to invest some serious time into getting it right.
CPU Madness: Scaling and Hyperthreads
Chips are designed with several goals in mind. Not just speed, but power efficiency, cost efficiency, and marketing efficiency also drive what happens in the processor. This means that modern chips do crazy things which will wreak havoc on your performance testing. Two of the big offenders are frequency scaling and hyperthreads.
Frequency scaling is a power saving technique by which your processor won’t run at full speed when it is idle. This means then when you do run something it takes a while for it to ramp up to full speed. If you run a performance test from a command-line you can see this. Run the same test several times in a row and you can see the first one took longer. This is because the first one suffered the overhead of going from an idle state (you at the command prompt) to full speed (in the timing iteration), and the following tests benefitted from already being at top speed. This scaling can also be per-processor, so if the kernel moves your thread around you may have to suffer this overhead multiple times. While doing performance testing I recommend just turning this off and running in high performance mode.
Hyperthreading is primarily a marketing trick used by Intel. What happens is that the chip lies to the operating system about how many processors are present. Instead of just 2 cores it will report 4. So even if you’ve taken the effort to set the processor affinity for your thread you could still have another thread interfering with it. The situations where hyperthreading can improve performance are few, though I admit I’ve never encountered even one. For a single program it is unlikely to ever help, and certainly not for code working with the same data. I recommend turning hyperthreading off to get usable performance test numbers (it also always improves my compilations times by turning it off).
High Level Inline Sampling
A lot of the above applies to measuring fractions of the full program. Such targeted testing is a good way to improve parts of the code, but it doesn’t really tell us much about the big picture. How do our small changes actually translate into real world performance in the application. If you get too lost in the details you’ll end up making lots of improvements which simply aren’t visible at the application level.
Inline sampling means you’ll select some key processes within your application. You will instrument these with timing code, probably the same code you use for the isolated testing. Unlike the isolated testing your test driver is the program itself: the execution will naturally end up at the timed section. Aaverage the timing over a specific number of calls and then output the results. Here it is almost vital to have a timer which measures variance as well as the average — you expect a lot more variation at this high-level.
High-level measurements offer another advantage, they are kind of immune from all the problems I talked about earlier. That is, they are certainly impacted by issues of memory caching and thread interruption, but at the high level this is almost certainly a legitimate part of the measurement. If you instrument you actual running code you are getting far closer to actual running conditions than isolated tests. Indeed you can often leave in such instrumentation and collect data from deployed instances.
Timing at the high-level is a great way to measure performance, and check for improvements, but it is not really a great way to reach those improvements. Once you’ve identified the process in the program you’d like to improve, and have the measurements, you will need to resort to lower level performance tests. Each time you step down a layer the pitfalls described here become more and more relevant.
Designing performance tests which give an accurate impression of algorithm speed is quite difficult. The issues presented here can give you a false impression of your code. Low level benchmarks are only a tool used in locating possible bottlenecks. Logical reasoning is also going to play a significant role. Often you can reason that certain changes will be improvements even if you can’t find a solid way to measure it directly.