Tags

, , , , ,

To support a low-latency trading system I needed a logging call that had minimal impact on the calling thread. It was imperative to be responsive to market events at all times. My solution was a wait-free producer-consumer system. It allowed logging in constant time, around 70ns.

The Basic Design

A key requirement for logging is to write statements, from any thread, in order, to a single log-file. A multiple producer and single consumer model is a good match for this. The pseudo-code for this system may look like below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
//Producer
write_log( log_statement item ) 
    lock( buffer )
        buffer.append( item )
    consumer_event.notify()


//Consumer
read()
    while true
        consumer_event.wait()
        lock( buffer )
            data = read( buffer )
            log_file.write( data )

This is a standard approach to concurrency. It uses a shared buffer, a vector or list, and a mutex to serialize access. An event notifies the consumer that new data is available in the buffer.

It’s Slow

For most projects I’ve seen this is an acceptable solution. For low-latency coding this approach is awful. It is slow and, more importantly, it has a large variance. Please keep in mind the term “slow” here is relative to the nanosecond scale at which I was measuring the system.

There are two parts that make it slow. The first is the lock itself. Acquiring and releasing a mutex is slow if there is any contention on that mutex. In our project there was often contention. This results in hundreds of nanoseconds used just by the locking alone. I’ll put the signalling in this lot as well since it’s essentially the same kernel operation. The second slow part is memory management for the buffer. A linked list is slow since every log entry requires new memory to be allocated. A vector was faster, except for the time when it needed to be reallocated.

Refer to how a mutex works for more details.

The same two aspects also have variable performance. Locking can block, so it’s uncertain how long writing a log statement will take. The consumer and producer are optimized to hold the lock as briefly as possible, but contention will still be high during bursts of log activity. The memory allocation, and potential buffer copying, also adds variance. Even after heavy optimization I had variances up to the millisecond level!

Off-loading work

Formatting strings, required by a log system, is a slow operation. One of the first improvements was moving all string formatting to the consumer thread. The producer thread assembled a list of arguments; the log_statement in the pseudo-code is a small container of boost::variant items.

To avoid dynamic allocation of individual log items the number of parameters had to be limited. The lower the number the better, but I still needed enough to be useful. I think we settled on around 8 parameters. This wasn’t chosen lightly, it was based on speed tests determining the cost-benefit trade-off of additional parameters.

The consumer thread is responsible for formatting these parameters and writing the string to the log file. There was a dangerous problem waiting in this logic. While the consumer is formatting strings it isn’t reading from the buffers. This gives the buffer a longer time to accumulate a high number of log items, resulting in more memory overhead. This large buffer in turn causes the consumer to take even more time formatting, thus the subsequent buffer could be even larger!

The system had lots of idle time; activity came in bursts. It was important for the consumer to cleanup the buffers quickly during those bursts, and then use the idle time to format the strings. I made the consumer write only a few log statements before returning to buffer cleanup. This slowed down the consumer with added locks and memory use, but prevented the worst of the burst problem.

Literal Strings

Profiling revealed that copying the format string was a significant part of the overall time. Since a string is a dynamic container it had to be dynamically copied. It seemed wasteful since the format strings are actually static: they are known at compile time.

The solution is to pass pointers to the strings instead. I was a bit concerned about the safety of this approach. If somebody accidentally called this with a dynamic string the consumer would end up reading a dangling pointer. I created a literal_string type; the log system only accepted this type and it couldn’t be created implicitly.

I took advantage of a macro trick to make this easy to use. C++ offers automatic string concatenation of string literals.

1
2
3
#define SAFE_LITERAL( str ) "" str

#define WRITE_LOG( str, ... ) write_log( literal_string( SAFE_LITERAL(str) ), params( VA_ARGS ) )

The "" str part only works if str is an actual string literal. The second macro can be used without worrying about the string literal aspect. If something other than a literal is passed compilation fails.

This approach improved the performance of logging.

The Bottleneck

Even after excessive optimization the locking overhead remained. The system had very good average performance but still had an awful variance. This is also a good lesson in profiling: always profile the maximum and variance and not just average times.

The locks had to go. The signalling had to go.

A first idea was to give each producer thread its own buffer. This alone doesn’t help. The resource is still being shared and that means locks must still be used. It does reduce the overall memory use during burst times, but the difference is marginal. Some way is needed to decouple the threads.

One important observation was how our logging thread was behaving. During active times that thread was at 100% utilization. When there was no activity we didn’t care how loaded that thread was. This gave us the option of having the logging thread burn a core. That is, it does a hot polling loop and never relinquishes control to the kernel.

Spin-Lock

The polling consumer opens up one new possibility: spin-locks.

I previously misused the term “lock-free” to refer to a spin-lock. I was always concerned it wasn’t correct, and now another helpful individual has clarified for me. Spin-locks are still definitely locks. The difference is that it is managed entirely in user-space and doesn’t make any kernel calls. The users of a spin-lock use a small loop that continually checks if the lock is available. Though slightly confusing, the Wikipedia article on “non-blocking algorithms” is a good start to get more into the theory of “lock-free” and “wait-free”.

I had sufficient conditions to use a spin-lock. Neither the producer nor consumer needed to hold the lock very long. The consumer thread can be high priority and allowed to spin. This guarantees the consumer acquires the lock for only the briefest of moments, ensuring the producer infrequently finds contention, and then only be blocked briefly.

Brief is of course relative. At the scale of nanoseconds these brief times still caused significant variance in the system.

At this point I’m coding at a fairly low-level. Using spin-locks requires atomic operations and a very good understanding of memory visibility.

Ring Buffer

The next step was to have a completely static memory size and remove all locking. Instead of competition over access to a vector based buffer, I decided to allow simultaneous access to a ring buffer.

The producer writes to the head of the ring buffer. After each write it updates the position of the head. The consumer reads from the tail of the buffer. After each read it updates the tail of the buffer. So long as the head never catches up the tail we’re fine.

This works because both the producer and consumer know they can safely access the memory without any kind of lock. Each log_statement in the buffer is distinct and can safely be accessed in parallel. The information about tail and head position is maintained with atomic operations.

Now, even more than on lock-free, a strong understanding of memory ordering is required. Having two threads access the ring buffer in this manner, in a low-latency system, leaves absolutely no room for error. Recall the consumer is continually polling so it immediately sees changes. I didn’t have access to C++11’s atomic type at the time. I was relying on GCC internals.

I wasn’t content to leave open the “what happens” scenario should the head catch up to the tail. I built in a warning when it reached 2/3 full. If somehow it caught up it then fell back to an actual mutex lock and waited for the consumer to catch up. The consumer placed a priority on cleaning out these buffers. As often as possible it moved data out of the ring buffer into its own private memory. Thus the warning only sometimes appeared, and the actual lock almost never. I used the occurrence of the warning as an indication to increase buffer size.

This approach, without any kind locks, is known as a “wait-free” algorithm. It provides a guarantee that the producer will always make progress. It never waits. For real-time systems this is great since it means the logging code path has a nearly constant execution time (near-zero variance). For us that was around 70ns.

A note on the consumer

The consumer takes the brunt of the work. It’s priority is to clean out the ring buffers. This results in having a lot of pending log statements from several threads mixed up in memory. Before actually writing to the log it needs to sort these statements. This could of course be done afterwards, but we wanted to have sane logs we could look at in real-time.

Formatting was quite slow, so the consumer actually falls behind the producers during any burst of activity. Fortunately, the behaviour of low-latency trading leaves a lot of essentially idle time. When something happens all cores are utilized to 100% for several milliseconds, but then long periods, even up to 1 second, can pass with nothing happening. This idle time was essential for the consumer thread to catch up with logging.

It is also important to use asynchronous IO, which just complicates it more. In order to clear ring buffers in time the consumer can never be waiting on a write operation to complete. I don’t recall now if I actually used async IO, or created yet another thread which simply blocked on IO. Though the actual danger of blocking here is somewhat limited. The file buffers on Linux are relatively large, and even with thousands of log items to write, the amount of data relatively small.

Recommended

I still highly recommend this approach. It is suitable not just for logging, but for any situation where you need to offload processing from a real-time thread. The solution is only practically available to a language like C++. The basic ring buffer can be done in C easily, but to get a clean logging solution would be hard without C++’s conveniences (in particular variants, tuple types, and parametric functions). The requirement to tightly control memory means it outside the reach of any language that doesn’t provide pointers and raw memory access.

Note, if you don’t need nanosecond level optimizations then don’t do this approach. It’s easy to get the atomic operations wrong. It’s also hard to find standard libraries for this type of ring buffer; the performance gains are heavily tied to the exact use case. Most systems can use the normal mutex approach I presented at the beginning of the article.

But of course it’s a fun thing to program. So even if you’ll never actually need it, I do recommend you try it out.

I’m a programmer who has worked on many high performance systems, from trading systems, to responsive web servers, to video games. If your project needs that extra speed boost, or a reduction in resources, then don’t hesitate to enlist my help.