Defective Language

Blocking and asynchronous operations without timeout are broken

Any time we are waiting for something to happen, from reading the disk to locking a mutex, we need to have a timeout. Without a timeout we run the risk of that operation never actually completing and our program completely hanging. It’s unfortunate that so many languages still have APIs lacking such timeouts.

A typical file hang

File code like below is abundant. I’d be surprised if there were any non-trivial server programs that don’t have such code.

1
2
handle = open(some_file)
text = handle.read()

It may not be obvious why this type of code can hang. If the file is on the local disk we generally expect a very quick response and never expect the disk to simply not respond. The problem is with abstraction. In the case of a file it’s never certain where that file really is. It doesn’t matter that open doesn’t accept URLs. The drive could be a network mount, it may be a pipe, or it could be a cloud block device. These things can all result in hung ‘read’ calls.

Everywhere by default

Any IO function has the potential to block. This includes writing functions; anything which writes can also wait endlessly for a buffer to flush, or connection to re-establish. Meta-information functions, like stat also require a timeout.

Any function which “waits” for something should also have a timeout. This goes beyond IO as it includes locking functions, such as mutexes. The assumption must always be that whatever we’re waiting for may never actually happen.

And this timeout must be applied by default; it is not merely an “available” feature. The default timeout should also be a short period of time, short enough so that any unusual delay will trigger it. This forces programmers to think about what happens in these cases, dealing with the timeout or consciously extending it.

Asynchronous

To be clear, the timeout applies to blocking and asynchronous calls equally. While it’s certainly helpful that a “hanging” async operation doesn’t block processing of other activities, it still results in some operation never making any progress. This usually leads to some external process never getting the response they were waiting for.

Bandwidth

We need to consider the definition of timeout as well. A common definition is simply a period of time where nothing happens. In many APIs, especially network sockets, the timeouts are only triggered if no data is exchanged. I don’t think this is valid.

Consider a situation with low bandwidth. There is little practical difference between a hung connection and one sending data at only 1B/s. Having one timeout and the other not seems wrong.

Streaming operations should have minimum bandwidth requirements. If a certain speed is not maintained then it should simply fail.

Total timeout

Many HTTP libraries are silly when it comes to timeout handling. We can find different parameters for the DNS lookup, the initial connection, the exchange of headers, and the document exchange. The one thing that very few provide is what I actually want: a total timeout.

I want to specifiy an upper limit for the time from when I start the HTTP request to the time the document is fully retrieved. I really don’t care why the request has failed.

If I expect a large document, or am streaming, I’d prefer to give an upper limit to the “negotiations” phase and a bandwidth requirement for the document phase. I’ve unfortuantely not seen either of these options in an HTTP library before.

No excuses

It’s unfortuante that libraries are not designed around failures by default. Perhaps 20 years ago this could have been forgiven, but now where even the most trivial of devices are multitasking and network enabled, it’s just not acceptable. All it takes it one minor hiccup to render some programs completely inoperable.

The simple rule is: if we are waiting for some event, we have to assume that it may never happen. We don’t need always need advanced error handling, but just some way to fail gracefully is often enough. Simply hanging there doing nothing is rarely helpful.

Categories: Defective Language, Programming

Tagged as:

3 replies »

  1. Very true! Perhaps the biggest difficulty though, once a timeout is offered by the library, is coming up with sensible defaults. It would be quite a burden to expect programmers know what they are nor two use cases are the same. To make matters more complicated program logic needs to often change from linear to reactive/event based. User interface also should accommodate by showing “pending” ops that perhaps can be canceled now, etc. So the effort to do this right, given todays languages, programming styles, libraries and frameworks is high while the payback often underappreciated by end users or impatient project managers.

    • I’m okay with the defaults being really high, like 10s for typical disk reads/writes. I’d just much prefer that things eventually fail and provide an error message rather than hang indefinitely.

      But the main point for a default is to ensure the library actually has a way to provide timeouts. It’s like an API check and should be part of the test suite for the library.

  2. Then I have another wish to attach to the above — effort should be made to make operations idempotent such that once a time-out does occur, retrying (unattended or human-driven) is safe. For example in a RESTful API, most calls can be made idempotent, even those that create new resources (via POST). A client attaching a client-generated transaction-ID, or some other unique token, can signal the backend that “if you see another one of those, just ignore it”. Many developers assume that just because they POST, side-effects are justified. To me, a POST with side-effects is similar to having a global variable or a go-to statement — they are not illegal but better have good justification.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s