Efficiency

Wait-free queueing and ultra-low latency logging

To support a low-latency trading system I needed a logging call that had minimal impact on the calling thread. It was imperative to be responsive to market events at all times. My solution was a wait-free producer-consumer system. It allowed logging in constant time, around 70ns.

The Basic Design

A key requirement for logging is to write statements, from any thread, in order, to a single log-file. A multiple producer and single consumer model is a good match for this. The pseudo-code for this system may look like below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
//Producer
write_log( log_statement item ) 
    lock( buffer )
        buffer.append( item )
    consumer_event.notify()


//Consumer
read()
    while true
        consumer_event.wait()
        lock( buffer )
            data = read( buffer )
            log_file.write( data )

This is a standard approach to concurrency. It uses a shared buffer, a vector or list, and a mutex to serialize access. An event notifies the consumer that new data is available in the buffer.

It’s Slow

For most projects I’ve seen this is an acceptable solution. For low-latency coding this approach is awful. It is slow and, more importantly, it has a large variance. Please keep in mind the term “slow” here is relative to the nanosecond scale at which I was measuring the system.

There are two parts that make it slow. The first is the lock itself. Acquiring and releasing a mutex is slow if there is any contention on that mutex. In our project there was often contention. This results in hundreds of nanoseconds used just by the locking alone. I’ll put the signalling in this lot as well since it’s essentially the same kernel operation. The second slow part is memory management for the buffer. A linked list is slow since every log entry requires new memory to be allocated. A vector was faster, except for the time when it needed to be reallocated.

Refer to how a mutex works for more details.

The same two aspects also have variable performance. Locking can block, so it’s uncertain how long writing a log statement will take. The consumer and producer are optimized to hold the lock as briefly as possible, but contention will still be high during bursts of log activity. The memory allocation, and potential buffer copying, also adds variance. Even after heavy optimization I had variances up to the millisecond level!

Off-loading work

Formatting strings, required by a log system, is a slow operation. One of the first improvements was moving all string formatting to the consumer thread. The producer thread assembled a list of arguments; the log_statement in the pseudo-code is a small container of boost::variant items.

To avoid dynamic allocation of individual log items the number of parameters had to be limited. The lower the number the better, but I still needed enough to be useful. I think we settled on around 8 parameters. This wasn’t chosen lightly, it was based on speed tests determining the cost-benefit trade-off of additional parameters.

The consumer thread is responsible for formatting these parameters and writing the string to the log file. There was a dangerous problem waiting in this logic. While the consumer is formatting strings it isn’t reading from the buffers. This gives the buffer a longer time to accumulate a high number of log items, resulting in more memory overhead. This large buffer in turn causes the consumer to take even more time formatting, thus the subsequent buffer could be even larger!

The system had lots of idle time; activity came in bursts. It was important for the consumer to cleanup the buffers quickly during those bursts, and then use the idle time to format the strings. I made the consumer write only a few log statements before returning to buffer cleanup. This slowed down the consumer with added locks and memory use, but prevented the worst of the burst problem.

Literal Strings

Profiling revealed that copying the format string was a significant part of the overall time. Since a string is a dynamic container it had to be dynamically copied. It seemed wasteful since the format strings are actually static: they are known at compile time.

The solution is to pass pointers to the strings instead. I was a bit concerned about the safety of this approach. If somebody accidentally called this with a dynamic string the consumer would end up reading a dangling pointer. I created a literal_string type; the log system only accepted this type and it couldn’t be created implicitly.

I took advantage of a macro trick to make this easy to use. C++ offers automatic string concatenation of string literals.

1
2
3
#define SAFE_LITERAL( str ) "" str

#define WRITE_LOG( str, ... ) write_log( literal_string( SAFE_LITERAL(str) ), params( VA_ARGS ) )

The "" str part only works if str is an actual string literal. The second macro can be used without worrying about the string literal aspect. If something other than a literal is passed compilation fails.

This approach improved the performance of logging.

The Bottleneck

Even after excessive optimization the locking overhead remained. The system had very good average performance but still had an awful variance. This is also a good lesson in profiling: always profile the maximum and variance and not just average times.

The locks had to go. The signalling had to go.

A first idea was to give each producer thread its own buffer. This alone doesn’t help. The resource is still being shared and that means locks must still be used. It does reduce the overall memory use during burst times, but the difference is marginal. Some way is needed to decouple the threads.

One important observation was how our logging thread was behaving. During active times that thread was at 100% utilization. When there was no activity we didn’t care how loaded that thread was. This gave us the option of having the logging thread burn a core. That is, it does a hot polling loop and never relinquishes control to the kernel.

Spin-Lock

The polling consumer opens up one new possibility: spin-locks.

I previously misused the term “lock-free” to refer to a spin-lock. I was always concerned it wasn’t correct, and now another helpful individual has clarified for me. Spin-locks are still definitely locks. The difference is that it is managed entirely in user-space and doesn’t make any kernel calls. The users of a spin-lock use a small loop that continually checks if the lock is available. Though slightly confusing, the Wikipedia article on “non-blocking algorithms” is a good start to get more into the theory of “lock-free” and “wait-free”.

I had sufficient conditions to use a spin-lock. Neither the producer nor consumer needed to hold the lock very long. The consumer thread can be high priority and allowed to spin. This guarantees the consumer acquires the lock for only the briefest of moments, ensuring the producer infrequently finds contention, and then only be blocked briefly.

Brief is of course relative. At the scale of nanoseconds these brief times still caused significant variance in the system.

At this point I’m coding at a fairly low-level. Using spin-locks requires atomic operations and a very good understanding of memory visibility.

Ring Buffer

The next step was to have a completely static memory size and remove all locking. Instead of competition over access to a vector based buffer, I decided to allow simultaneous access to a ring buffer.

The producer writes to the head of the ring buffer. After each write it updates the position of the head. The consumer reads from the tail of the buffer. After each read it updates the tail of the buffer. So long as the head never catches up the tail we’re fine.

This works because both the producer and consumer know they can safely access the memory without any kind of lock. Each log_statement in the buffer is distinct and can safely be accessed in parallel. The information about tail and head position is maintained with atomic operations.

Now, even more than on lock-free, a strong understanding of memory ordering is required. Having two threads access the ring buffer in this manner, in a low-latency system, leaves absolutely no room for error. Recall the consumer is continually polling so it immediately sees changes. I didn’t have access to C++11’s atomic type at the time. I was relying on GCC internals.

I wasn’t content to leave open the “what happens” scenario should the head catch up to the tail. I built in a warning when it reached 2/3 full. If somehow it caught up it then fell back to an actual mutex lock and waited for the consumer to catch up. The consumer placed a priority on cleaning out these buffers. As often as possible it moved data out of the ring buffer into its own private memory. Thus the warning only sometimes appeared, and the actual lock almost never. I used the occurrence of the warning as an indication to increase buffer size.

This approach, without any kind locks, is known as a “wait-free” algorithm. It provides a guarantee that the producer will always make progress. It never waits. For real-time systems this is great since it means the logging code path has a nearly constant execution time (near-zero variance). For us that was around 70ns.

A note on the consumer

The consumer takes the brunt of the work. It’s priority is to clean out the ring buffers. This results in having a lot of pending log statements from several threads mixed up in memory. Before actually writing to the log it needs to sort these statements. This could of course be done afterwards, but we wanted to have sane logs we could look at in real-time.

Formatting was quite slow, so the consumer actually falls behind the producers during any burst of activity. Fortunately, the behaviour of low-latency trading leaves a lot of essentially idle time. When something happens all cores are utilized to 100% for several milliseconds, but then long periods, even up to 1 second, can pass with nothing happening. This idle time was essential for the consumer thread to catch up with logging.

It is also important to use asynchronous IO, which just complicates it more. In order to clear ring buffers in time the consumer can never be waiting on a write operation to complete. I don’t recall now if I actually used async IO, or created yet another thread which simply blocked on IO. Though the actual danger of blocking here is somewhat limited. The file buffers on Linux are relatively large, and even with thousands of log items to write, the amount of data relatively small.

Recommended

I still highly recommend this approach. It is suitable not just for logging, but for any situation where you need to offload processing from a real-time thread. The solution is only practically available to a language like C++. The basic ring buffer can be done in C easily, but to get a clean logging solution would be hard without C++’s conveniences (in particular variants, tuple types, and parametric functions). The requirement to tightly control memory means it outside the reach of any language that doesn’t provide pointers and raw memory access.

Note, if you don’t need nanosecond level optimizations then don’t do this approach. It’s easy to get the atomic operations wrong. It’s also hard to find standard libraries for this type of ring buffer; the performance gains are heavily tied to the exact use case. Most systems can use the normal mutex approach I presented at the beginning of the article.

But of course it’s a fun thing to program. So even if you’ll never actually need it, I do recommend you try it out.

I’m a programmer who has worked on many high performance systems, from trading systems, to responsive web servers, to video games. If your project needs that extra speed boost, or a reduction in resources, then don’t hesitate to enlist my help.

28 replies »

    • Each thread had it’s own ring-buffer. This removed any need to coordinate between producers. The consumer then checked all of the ring buffers.

      The first call to log would create the ring-buffer if the thread didn’t have one. A thread teardown function (using boost’s thread_local_ptr) would mark the ring-buffer as free once the thread was finished. The consumer would then remove unused buffers. However, in our situation we weren’t really creating/destroying threads other than at startup/shutdown.

    • How were you able to ensure log was getting written in precise order of events, if each producer had its own ring buffer? Was normal clock sufficient to order the events across producers?

    • The high-precision timestamps (clock_gettime, or RTDSC op if I recall correctly) are precise enough to ensure ordering in the consumer. It would scan all rings and sort the entries. This clock is guaranteed to be in-sync between the cores as well (in previous CPU generations it was not I read). Reading the time was actually a significant part of the cost of creating a log entry.

    • Yes, I created my own. At the time I looked around and couldn’t find anything suitable (I’m not sure if spsc_queue existed at the time).

      By this point I also knew exactly what I wanted out of my ring buffer and was certain how to optimize it for what we needed. The tailored solution probably buys a bit of performance. It’d be hard to know without profiling both approaches.

  1. Very good work. It would be interesting to compare this approach with the Disruptor pattern, which also uses a ring buffer as a queue to reduce contention and optimize CPU caching, but was designed for few producers and many consumers.

    • I’m not familiar with the Disruptor pattern, but from what I read it sounds familiar. Our message passing used more traditional vectors and swapped pointers (contention was higher and we were forced to used mutexes). Though the pattern of reserving space for a message, then committing it is something we used. It doesn’t seem like much, but avoiding that one copy operation does help.

    • I suspect anybody with a need for efficient message passing between threads, and has cores to spare, will end up at a similar solution.

      I wonder how they did it in Java, or if they’re missing a few tweaks. Some of the stuff I did required operations I wouldn’t even know how to do in Java. One of our big gains was a friendly high-level call, which also used featured that Java doesn’t have (okay, in fairness, it has the, but not in a fashion that is efficient enough for what I want.)

    • It would be good to have a real point of comparison for the Java vs C++ debate.

      You can find a very high quality implementation of a single producer single consumer queue written in java here, written by one of the guys behind the disruptor.

      https://github.com/mjpt777/examples/tree/master/src/java/uk/co/real_logic/queues

      I have benched the most optimised version at ~92 million messages per second. I would be really curious to know how that compares to the C++ implementation.

      What was the friendly high-level-call you mentioned above?

    • You have to be attentive to what exactly you are passing between threads. Message passing can be faster if you are passing pointers (and Java references) to the messages and letting the other thread clean it up. What you likely aren’t measuring then is the overhead of creating that initial message.

      The friendly high-level call looked just like printf with a log-level:

      log( info, “hello|%d”, 13 );
      log( error, “it-failed|%f;%d”, floatVar, intVar );

    • It is definitely the case that the benchmark cited above does not measure any kind of message creation nor cleanup, so it’s not an apples to apples comparison for sure.

      I’ve not looked into fast string formatting in Java. Given that variadic functions in Java are just sugar for creating a new array I can imagine it would be very difficult to match the performance of c++ for this instance.

  2. At 1GhZ, 70ns is 70 cycles. Or roughly 70 instructions. Assuming you have a 4Ghz processor, that means your logging is complete in 280 instructions, including copying the data. What am I missing?

    • The machines were around 4Ghz, yes. And you’re correct, the logging was around that number of instructions. This is why reducing the amount of data to copy and reducing the copies is very helpful.

      Note that CPU instructions vary in duration. You can move multiple bytes in one cycle, some branches are predicted, some things reordered.

      There was still some variance in the speed. I think it went up 10-20ns if you used all of the parameters.

    • I made sure the priority of the consumer was to flush the ring buffers to avoid them ever getting full. This was kind of messy. The consume could never do much at once, always having to return to checking the queues. When it did find something it’d copy it to a termporary space. Once all the queues were empty it’d sort the entries. Then it’d start writing out log entries. After several entries, like 10, it’d check again. If more was found it’d have to resort (just the new entries).

      If the buffer ever got full the producer used a genuine lock to wait on the consumer. It rarely happened due to the consumer flushing quickly and setting the ring buffer size high enough. You can’t just set a huge buffer size though since then you lose memory locality and incur yet another penalty.

  3. You mention how you don’t want to measure just average latencies. What kind of stdDev are you seeing on the log time? How far away does it drift from 70ns?

    Do you deal with cache-line alignment / false sharing explicitly?

    • It was very stable, only several nanoseconds variance if I recall. There was a slight variation on the number of parameters used for the log statement. Likely CPU caching was behind the remaining small variance. But low enough to be considered none: you could probably say that no statement ever took more than 90-100ns.

      Structures in the ring-buffer were cache-line aligned. The consumer though only ever modified enough data to cover one cache-line, thus false sharing was never really an issue.

    • Is this 70ns entirely encapsulated in one thread, or does it include the time to write to the queue? What is the work actually being done? Is it just a matter of filling in the formatter and writing to yet another buffer?

      I remember reading somewhere that theoretical boundary of crossing the cpu boundary (non-hyperthreaded) is ~65ns. Have you measured this element?

    • The 70ns measures the entire time for a high-level log call to place data on the queue and return to the caller. For example:

      log( warning, “danger|p1=%d;p2=%f”, intVar, floatVar );

      The time is for this call to complete. That includes creating the `log_statement` structure, copying the string literal pointer, copying the parameters, adding standard fields, like time and thread id. This is then placed on the ring buffer (well, actually the data is directly copied to the ring buffer, and at the end the tail pointer is updated).

      How fast the buffer could be cleared is of course longer. I didn’t measure exactly how long this time was, but did know at what rate items could be cleared (in terms of entries per second). I don’t recall that number.

      I attempted to measure the low-level time to cross-cpu boundaries using a ping-pong scheme. I recall it being lower than 65ns, but I don’t know the actual result anymore. I had created many such tests to better explore the CPU and understand where improvements could be made.

      To recap, 70ns was the time experienced by the calling thread to get a log statement into the queue.

  4. > This approach, without any kind locks, is known as a “wait-free” algorithm. It provides a guarantee that the producer will always make progress. It never waits.

    As a clarification, the use of spinlocks necessarily means that the solution uses locks. Also, it’s not wait-free because it’s possible that a userspace thread will be preempted while holding a lock (userspace does not relinquish the CPU to the kernel. The kernel decides when userspace gets to use the CPU). This would result in the other threads spinning while trying to acquire the lock, i.e., making no progress. This scenario won’t happen when using spinlocks in the kernel, with the exception of interrupt handlers, which can anyway be disabled when acquiring the lock.

    The main benefit I think you are seeing with spinlocks is that the scheduler won’t automatically sleep the consumer thread if there’s contention. But who knows, I could be way off.

    • There are two main reasons why spinlocks are still sometimes useful, and can be a lot faster than normal mutexes. The first is they avoid the overhead of going to the kernel. Pretty much any kernel call, regardless of what it doesn, has a significant overhead. I measured some of this once, and something like freeing a mutex, can take over 100ns, if not closer to 200ns (I don’t remember exactly now). The whole lock/unlock cycle then has several hundred nanoseconds of potential overhead.

      The second reason is losing the CPU. If you don’t call the kernel it doesn’t have a chance to pause your thread and give the CPU to another one. At the time scales we were looking at preemption wasn’t an issue. The workload tends to be finsihed within a time slice for most threads. This allows the cache memory to be retained for the whole workload, whereas losing the CPU causes the caches to be overwritten.

      But just to be clear, the final solution does not involve spin locks. It truly is a wait-free algorithm. Assuming the buffer isn’t full, there is nothing that can cause the producer to block, or even pause. It just writes to memory, issues one or two atomic operations, and continues.

  5. What happens if there is power fail but trade logs are in the circular ring buffer (memory) but the consumer has not drained (or flushed) the buffer to persistent device? Are there any requirements in trading systems to be guarantee 100% persistence of ALL the trade logs?

    • Speed trumps most things in this domain.

      Power loss is unlikely as they have battery backup in the data centers usually. But still some other disaster could happen.

      The ring buffer adds only a tiny delay on top of the system. The eventualy writing to the actual disk (or network) is the slowest part. Nobody would agree to waiting for a proper disk flush before continiuing, it would be way too slow.

      In the case of breakdown it’s a manual job of reconstructing from the local logs and the exchange logs what the positions are.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s