Are our discussions about errors focusing on the right part of the problem? We tend to argue about what exceptions mean, or how return values are messy. But if I look at a lot of code the actual capturing of the error isn’t the most difficult part. It’s everything surrounding it.
What comes before
A typical imperative function does several things in sequence. If one of these things go wrong we have to worry about the side-effects of what came before. For example, with a DB we might want to undo what we did before.
1 2 3 4 5 6 7 |
db.start_transaction() db.insert_row(...) db.insert_row(...) db.update_row(...) !error_detected db.rollback() |
I’ve avoided using either an exception or a return code to highlight the critical piece of this code, the db.rollback()
statement. This defines how this piece of code handles errors. What particular error happened, or how we propagate to our caller, are secondary details.
Transactional databases are easy to handle. The start / rollback / commit structure is good at dealing with errors. If something goes wrong we simply abandon the transaction. It’s like we never modified the DB at all. We’ve managed to avoid any side-effects in the face of an error.
Avoiding side-effects is clearly the best approach to handling errors. If we simply have nothing to clean up then we can just return the error to the caller and not worry about it. The idea of side-effect free programming is perhaps the defining features of several of the “functional” languages. Ultimately though even programs in those languages have side-effects at some point, and must also deal with errors.
Most APIs have no rollback
The problem is that most APIs are not designed with error handling in mind. Or rather, it’s unclear how we even design general purpose APIs that make error handling easy. Consider a very simple scenario where we wish to add the result of two calculations to two different vectors.
1 2 3 4 5 6 7 |
defn store_results = ( x ) -> { var a = calculate_a(x) vector_a.add(a) var b = calculate_b(x) vector_b.add(b) } |
There’s a problem if an error happens in calculate_b
: we’ve already added the result to vector_a
. We could of course just remove it again. The error conscious programmer might write the function this way instead:
1 2 3 4 5 6 7 |
defn store_results = ( x ) -> { var a = calculate_a(x) var b = calculate_b(x) vector_a.add(a) vector_b.add(b) } |
It’s still possible that adding to vector_b
fails, but is perhaps less likely. Perhaps just removing the result from vector_a
is the correct approach?
Irreversible side-effects
Even in my simple vector example I already have questions about how to recover from an error. What happens as the complexity of my code increases? Even worse, what if the operations I’m doing don’t have a way to undo the operation?
1 2 3 4 5 |
defn append_results = ( file, a ) -> { file.write(a.x) file.write(a.y) file.write(a.z) } |
What happens when something fails in the second or third statement? Assume that it’s the field access to a
, and not an underlying file system failure. If we can’t complete the write operation, and can’t undo what we’ve written before, we’ve just corrupted the file.
On the filesystem itself this type of problem tends to be addressed with journalling. We first write a marker saying we’re updating a file, then write the new data, then mark the operation as complete. Any step can fail without corrupting the filesystem. This is how the integrity of our folder structures is maintained even when our computer crashes. Unfortunately this type of design rarely makes it to high-level APIs.
There are many operations that have side-effects that are hard to reverse, or simply irreversible. If it’s critical that our data does not get corrupted we actually have to design the protocol and file formats around errors. It’s completely irrelevant how we’ve detected and caught an error if we have no way to deal with it.
Language failings
Perhaps this is why discussions of error handling are wraught with emotions and uncertainty. It’s not the error detection itself we’re struggling with, but the error recovery. Features like destructors and finally
help clean up resources, but do little to help with the irreversible side-effects. Seriously, if I’ve just corrupted my data store it’s of very little consolation that the file handle and memory will be cleaned up.
It’s an uncomfortable truth that we just don’t know the best way to deal with errors, often we don’t even know a good way to deal with errors. Each bit of code ends up requiring a special approach, or even a requirements change. Properly handling errors requires errors be considered through the entire design process, not just something tacked on later.
Even if we’re error conscious we have a lot of non-error friendly libraries to deal with. Our languages are also severely lacking, providing only some baseline facilities. As long as dealing with errors is cumbersome or difficult we’re going to keep writing code as we do now: assuming errors won’t happen.
I don’t know what the solution is, but it has to be a systemic one. We can’t just add some keywords and error facilities to a language and expect it to be enough. Error handling must be woven into the fabric of every feature, every API, and every library.