Error handling should be simple. It can become mired in complexity, but I don’t believe it has to be that way. If we stay focused on fundamentals, what seems like a difficult problem can become much easier to deal with.
When I look at error handling I try to consider only the immediate consequences of the error. Right in the current function, what can I do to handle it? This is an application of modular design. I also try to keep the logic straightforward without too many conditions: reducing complexity minimizes the chances of defects. Together these two fundamental programming principles reduce error handling to three basic scenarios.
Note: I will assume errors can be properly propagated and handled using either exceptions or return values. Which one to choose is not relevant to this discussion.
Unwind
Checking preconditions and validating input to a function is the easiest scenario. This can prevent a minor input problem from becoming a more serious error. Such pre-checks prevent state changes. It is therefore safe to simply return an error to the caller.
[sourcecode]
function calculate( value )
{
if( value > 10 )
raise OutOfRange;
…
[/sourcecode]
In the example above, the check on the ‘value’ variable sends the error up to a higher level. It is reasonable to call this prevention, as we have prevented the error from doing any changes to our state. The error may then continue to be passed higher up. This happens repeatedly in a process known as unwinding.
Undo and Unwind
Perhaps some owned data has already been modified. Here the question to ask is whether the previous state can be restored. Perhaps the function has created a temporary object or started to change an internal data structure. Before the function returns it must be able to undo all such changes. So long as state changes can be undone, we can still consider this to be unwinding. The caller cannot see a difference between prevent and undo: in both cases the caller gets an error and is assured the state is still okay.
Consider the following code. It creates a file; however should an error happen, it can just delete that file. This returns the system to its state prior to the function call.
[sourcecode]
function generate( name )
{
var file = file.create( name );
if( !produce_output( file ) )
{
file.delete();
raise Error;
}
//no error indicates okay
}
[/sourcecode]
Recover and Continue
As the error unwinds it will pass through a variety of functions, or it may even be encoded in messages, dispatched between threads and even processes. At some point however the unwinding cannot continue. A function will realize that some resource, be it an object, hardware state, network connection, or any mutable thing, is now in an inconsistent state: it is corrupted, or faulty. Now we need to decide what we can do about this corrupt resource.
Corruption doesn’t have to mean there is a major problem. Most often, assuming the error handling is robust, the corruption will be minor, or even expected. Consider the following scenarios in which recovery can be done. In each case the goal is simply to recognize the error, deal with it, and get on with the program as though it didn’t happen.
- While copying a file the OS has reported a write error. We can recover by simply deleting the destination file and displaying an error message to the user.
- An HTML document may have mismatched tags. Instead of giving up and unwinding, the parser records a warning, reorders some tags, and continues on its way.
- The user has entered an invalid value into a form. After unwinding the erroneous processing, we highlight the field in red, alert the user, and wait for new input.
- A connection to the server has been lost. We can try a reconnection, and if possible, we simply post pending messages.
- Each tab in a browser has a variety of scripts running. We detect that one script doesn’t appear to be terminating, stop the script, and alert the user (possibly removing the tab afterward).
Two basic patterns emerge: either you actually fix the state or you delete the offending resource. The goal is to rid the system of any broken components, then continue on as before. After recovery the system should be in a fully functional state again with no lingering influences from the error.
Let’s take a look at this in a bit more detail. In the code below we are processing a message stream which produces files as output artifacts. It can work with several files at a time and the process for a single file can exist for any period of time (we are processing many files in parallel). Each incoming message requires some operation to be performed and the results to be written to the file. The resulting file is only valid if all message actions are performed correcly.
[sourcecode]
function process_stream_message( msg )
{
var object;
//create new or get existing object
if( object_map.exists( msg.id ) )
{
object = object_map.get( msg.id );
}
else
{
object = new object_type;
object_map.insert( msg.id, object );
}
var result = object.perform_action( msg.action );
if( result is error )
{
object.file.delete();
object_map.remove( msg.id );
delete object;
report_to_user( result );
}
else if( msg.final )
{
object.file.close();
object_map.remove( msg.id );
delete object;
}
}
[/sourcecode]
Let’s say that if an error occurs in the ‘object.perform_action’ code, our ‘object’ is now in an inconsistent state. To recover, we remove the generated file and delete the object. Our state is now consistent as memory of the error has been completely erased. This is what recovery actually entails: getting back to a fully functional state and completely ending the life of the error itself. After this point the code behaves as though the error had never occurred. If the error still has any lingering effects we can’t say that we have successfully recovered.
The Difference
At a glance, this example may look similar to the unwinding scenario. However here we have deleted an object that was potentially used across multiple function calls. Thus the state after deleting the object is not the same as before the function call (the object exists before, not after).
[sourcecode]
System State A
call function foo
System State B
[/sourcecode]
In normal operation (no errors) we expect that the call to ‘foo’ will alter the system state: state B will not be the same as state A. What we’ve discussed so far are the two possibilities when an error does occur: if ‘foo’ is able to unwind then state A will be equivalent to state B. Other than an error indicator of some kind, it will be as though ‘foo’ had not been called at all. However, if the error handling involves recovery, something has to change: state B will be different from state A.
Cases where the state changes (recovery) are inherently more difficult to program. A conscious decision has to be made about what recovery means. Indeed it may become an item of contention within the team since there is not always an obvious optimal solution. In contrast, no such decision has to be made during unwinding: either it is done correctly or not. Recovery is subjective, and precisely how it is done affects the value of the system.
Corruption Propagation
What about situations where non-local recovery is possible? That is, we’ve identified a corrupted state but are unable to correct it in the current function. Consider the previous ‘process_stream_message’ function. It calls the function ‘object.perform_action’ which may result in a stack of functions being called (displayed from the most recently called down):
[sourcecode]
object::modify_state
object::action_type_a
object::perform_action
process_stream_message
[/sourcecode]
In many cases, especially with a bit of attention to error handling design, an error in ‘modify_state’ can unwind all the way to ‘process_stream_message’. This top-most function then does recovery by removing the object from the system. However, what if state corruption happens at one of the intermediate functions and they have no ability to recover?
Say for example that ‘action_type_a’ makes multiple calls to ‘modify_state’. If the second of these calls fails, it cannot unwind and does not know how to recover. In this case we can allow a corruption propagation to take place. Unlike unwinding, every object along the way is left in a broken state. Nothing inside ‘object’ itself knows how to correct the problem, so it just passes along this corruption error. When we reach ‘process_stream_message’ we finally know how to recover. We can do so by deleting ‘object’ along with any resources it was using.
Keep in mind that the details of such corruption propagation are primarily internal. ‘processs_stream_message’ doesn’t care about the source of the corruption error code. When handling the error it just worries about its immediate situation. Internal to ‘object’ however, we actually need an error system capable of propagating or tracking the corruption. The propagation of this corruption must be quite limited though. The further it goes back up the stack, more and more objects become tainted and a successful recovery becomes less likely.
Abandon and Shutdown
We can’t assume that recovery is always possible. Though it should be rare, it is possible to get into a corrupted state where there appears to be no way out, or the cost of recovery may be too high. In such cases we can consider abandoning the program. This usually means shutting down but it can refer to any controlled loss of functionality (which is really just shutting down some particular module of a system). In any case, continued normal operations are simply not possible. There will be no recovery.
Here I’m not speaking of an abrupt termination, but rather a controlled shutdown. If the code is actually capable of detecting an advanced error state, it should also know how to properly end the life of the program. This is a situation in which we cannot look at the local function or module anymore. We have no choice but to consider the entire program. To properly clean up and shut down, we must be aware of the global state and have some kind of control mechanism. This control mechanism may either be a global function we call, or a special error propagation that indicates abandonment. The difficulty in coming up with such a mechanism should be a strong hint that unwinding and recovery tend to be the preferred option.
One key aspect of a controlled shutdown is proper error and state reporting. Even if the actual shutdown is trivial, we need to make sure that the correct information is recorded so we can find out what happened and/or debug the problem. By terminating incorrectly it may become impossible, or at least extremely difficult to find out how to prevent the problem from occurring again.
Conclusion
As we progress from one step to the next we see an increase in the amount of work involved. Simple unwinding involves the least amount of effort. If one needs to undo, the amount of works increases. For this reason it makes sense to front-load all of the data checks and sub-actions before changing any internal state. The longer such internal modifications are postponed, the easier it is to unwind. At the point where we need actual recovery, the amount of work needed relates to how difficult it is to recover. If all one needs to do is to delete the offending resource, then it’s simple. It makes sense to think about this while designing modules to make recovery as easy as possible.
While abandoning seems like it might be quite cheap to implement, its trade-offs are in system instability and lack of error information. This might make the system difficult to work with in any number of modalities: for the end-users, administrators, and developers. I try to avoid getting into such situations, preferring recovery wherever possible. In the cases where it does happen, I try to conduct a sane termination procedure, providing appropriate logging, and giving decent user feedback.
Those are the three basic error handling methods: unwind/undo, recover, and abandon/shut down. In practice unwinding should be the most commonly used approach, accounting for the vast majority of cases, with only a tiny fraction requiring undo. As all unwinding must end, there will also be a significant amount of recovery. Recovery should be straightforward and should avoid complex restoration schemes if at all possible. Abandoning the program should be a last resort. If it’s necessary, make sure to put serious thought into a robust and correct implementation.