Any time we are waiting for something to happen, from reading the disk to locking a mutex, we need to have a timeout. Without a timeout we run the risk of that operation never actually completing and our program completely hanging. It’s unfortunate that so many languages still have APIs lacking such timeouts.
A typical file hang
File code like below is abundant. I’d be surprised if there were any non-trivial server programs that don’t have such code.
handle = open(some_file) text = handle.read()
It may not be obvious why this type of code can hang. If the file is on the local disk we generally expect a very quick response and never expect the disk to simply not respond. The problem is with abstraction. In the case of a file it’s never certain where that file really is. It doesn’t matter that
open doesn’t accept URLs. The drive could be a network mount, it may be a pipe, or it could be a cloud block device. These things can all result in hung ‘read’ calls.
Everywhere by default
Any IO function has the potential to block. This includes writing functions; anything which writes can also wait endlessly for a buffer to flush, or connection to re-establish. Meta-information functions, like
stat also require a timeout.
Any function which “waits” for something should also have a timeout. This goes beyond IO as it includes locking functions, such as mutexes. The assumption must always be that whatever we’re waiting for may never actually happen.
And this timeout must be applied by default; it is not merely an “available” feature. The default timeout should also be a short period of time, short enough so that any unusual delay will trigger it. This forces programmers to think about what happens in these cases, dealing with the timeout or consciously extending it.
To be clear, the timeout applies to blocking and asynchronous calls equally. While it’s certainly helpful that a “hanging” async operation doesn’t block processing of other activities, it still results in some operation never making any progress. This usually leads to some external process never getting the response they were waiting for.
We need to consider the definition of timeout as well. A common definition is simply a period of time where nothing happens. In many APIs, especially network sockets, the timeouts are only triggered if no data is exchanged. I don’t think this is valid.
Consider a situation with low bandwidth. There is little practical difference between a hung connection and one sending data at only 1B/s. Having one timeout and the other not seems wrong.
Streaming operations should have minimum bandwidth requirements. If a certain speed is not maintained then it should simply fail.
Many HTTP libraries are silly when it comes to timeout handling. We can find different parameters for the DNS lookup, the initial connection, the exchange of headers, and the document exchange. The one thing that very few provide is what I actually want: a total timeout.
I want to specifiy an upper limit for the time from when I start the HTTP request to the time the document is fully retrieved. I really don’t care why the request has failed.
If I expect a large document, or am streaming, I’d prefer to give an upper limit to the “negotiations” phase and a bandwidth requirement for the document phase. I’ve unfortuantely not seen either of these options in an HTTP library before.
It’s unfortuante that libraries are not designed around failures by default. Perhaps 20 years ago this could have been forgiven, but now where even the most trivial of devices are multitasking and network enabled, it’s just not acceptable. All it takes it one minor hiccup to render some programs completely inoperable.
The simple rule is: if we are waiting for some event, we have to assume that it may never happen. We don’t need always need advanced error handling, but just some way to fail gracefully is often enough. Simply hanging there doing nothing is rarely helpful.