Philosophy

Antiquated Error Handling: abort/exit

The abort function is a remnant of old programming practices, and it pains me to see it in modern software. While it’s great that a program detects an error, calling abort is equivalent to surrendering. It’s stating the program is unable to figure out how to deal with the error correctly. Calling exit from the middle of a program is the same thing: prematurely aborting the program. My criticisms are thus not strictly about abort, but any related method that abruptly terminates the program.

Loss of information

Abort statements do not carry a lot of useful information with them. Whatever context the program had, such as a stack trace and memory state, is lost once abort is called. The common case of abort, through an assert statement, does produce a single line of information. Most programs are complex enough that single statements like these become meaningless: while such a statement may clearly indicate the immediate error condition, lacking context prevents one from finding the root cause of the problem.

I’m not forgetting that abort can result in a core dump (you need to turn them on). You could use a debugger to load this core and get information about the program: probably a stack trace and sometimes, if you’re lucky, you can investigate bits and pieces of the memory. A core dump is of course better than nothing. However, a structured error handling mechanism can produce more relevant details and will generate a more useful error report. Having to resort to always using the debugger to get such information also slows down the development process considerably.

I have begun using LLVM as the last compilation stage for my Cloverleaf language. It uses a lot of assertions to detect errors. While I appreciate that the errors do get detected, the resulting abort is at the very least a serious annoyance. Were I able to correlate errors back to the source code I was compiling, it would be a lot easier to diagnose and correct problems. Instead I’m presented with generic error messages which reveal almost nothing about what I did wrong.

Unable to handle the error

It may seem obvious, but a very distinct problem with abort is that it does exactly what it says: aborts the program. This makes it impossible to recover. The number of errors conditions where recovery is not possible are few and far between. There are a handful of such cases, but even fairly extreme scenarios like a system fault, resource exhaustion, or even programmer error, can be handled gracefully. Calling abort is a unilateral decision that a given error condition is more important than every other line of code written on the project.

On a previous project I used Google Protobuf for some of our serialization. That library would log to the console and abort when it detected errors. Our program was built to deal with errors, runtime and programmatic alike. We could easily recover from such errors if only given the chance. These calls to abort placed huge wildcards into the stability of our application. I was grateful that the library was eventually modified by our request to throw an exception instead.

Conclusion

It is good to detect errors and prevent them from propagating. The correct approach is to return an error or raise an exception. In the case of LLVM one might argue that their library doesn’t allow error codes at this point, and exceptions aren’t used, so they have no choice but to call abort. However, I see this as a significant flaw in an otherwise very useful suite of tools. Libraries should be designed to deal with errors properly. Aborting a program is simply not a reasonable form of error handling.

13 replies »

  1. I think the abort paradigm is used primarily (caveat: by decent developers) for unexpected states, where all defensive programming has failed, and somehow I am here … now what do I do? If I carry on, after something odd, and unknown, has happened, I will most likely corrupt my data or the user experience, as the unknown state flows through my app. Log, apologise to user, and abort.

    • Is there is how I was seeing it used I wouldn’t have a problem. I also don’t see where such situations can exist in most programs. You always have the option of returning from a function before you corrupt the data (or throwing an exception). Even though I said there are rare excepted cases, I was unable to think of any as an example. Can you think of a scenario where “abort” is truly the correct option?

  2. My experience taught me that the possible meanings of word “error” are too diverse for us to be able to make a generic statements about handling *any* errors. Different kinds of errors may require different kinds of handling.

    When we run out of resources, I agree, throwing an exception would be preferable (although one should be cautious if exception objects themselves do not require resources). There may be a way to recover from this situation. I guess error codes would also do for this purpose, although they are not my preference.

    When we encounter unexpected user input, again, I agree an exception or an error code is better. Recovery exists, or at least may exist.

    But when it comes to detecting a programmer’s bug in the code; when we suspect that it is the program itself that is causing the trouble, because it does something opposite than it was designed for (this is what assertions are supposed to detect) abrupt termination of the program may be just the right thing.

    I found it useful in the application I was developing. We work on application that can be considered life-critical, but is not real-time. It is used to perform weight and balance computation of the passenger flights. If the program crashes, there is not much harm done: users can either restart or go to manual computations. However, if this program is producing incorrect information about aircraft’s balance, weight, and other safety conditions, the consequences may be disastrous. In many places in the code we do safety checks (for programmer errors). If we find that program has entered an invalid state we choose to generate a core-dump and abort. We do not want to throw an exception, because we do not want the application that entered into an invalid, uncontrolled, state to continue running. We need it to be killed before it causes any harm. We do not believe you can return from invalid state into a valid state. Also a core dump is essential, because the application is running on the customer’s machine and this is the only way we can get the information about the context of the problem.

    Regards,
    &rzej

    • I don’t think one can draw a clean distinction between a programmer’s bug and a fault. Minor faults (like invalid user input) can easily lead to an error as the programmer has not correctly handled them. These types of transitive errors make it too difficult to separate the two classes to be of any real value.

      Detecting corruption of global state I consider an entirely distinct scenario. It is also one of those scenarios where “abort” may be acceptable. In this scenario you are detecting a present error, not a potential error. That is, you have detecting an error after the fact (or have detected a system fault). If the error is unanticipated and has no corrective reaction, then yes, a critical failure is appropiate (though, depending on the error, I would prefer an exception which while still terminating the process, provides an improved information trail).

    • I don’t think one can draw a clean distinction between a programmer’s bug and a fault. Minor faults (like invalid user input) can easily lead to an error as the programmer has not correctly handled them. These types of transitive errors make it too difficult to separate the two classes to be of any real value.

      — Perhaps there exist some cases where a clear distinction cannot be made. I would not like to make too generic statements. But in case of user input the distinction appears clear — at least to me. Invalid user input is what I understand you call a “fault”, that is some unusual condition that we do not prefer, but should be prepared for. However not being prepared for invalid user input and letting it pass into the guts program unchecked is programmer’s error (or a ‘bug’). This could be generalized: not being prepared to handle an unusual condition in environment (bad user input, resource exhaustion) is programmer’s error. Not separating clearly in the code the two different situations is also a deficiency in the program code, although you can hardly call this a bug yet. Things like Contract Programming or simple asserts help make the distinction clear. To some extent it is the decision of the programmers whether they want a clear distinction between programmer’s error and unusual condition (a fault).

      If the error is unanticipated and has no corrective reaction, then yes, a critical failure is appropriate (though, depending on the error, I would prefer an exception which while still terminating the process, provides an improved information trail).

      — It looks like we differ in the opinion on what exceptions can offer. I know that in languages like Java exceptions offer a stack trace that includes info on every function call in the call stack. I do not use Java in practice though. I use C++ exceptions, and they offer only this portion of information that you yourself put inside. And while this is enough information to recover from exceptional situation, it doesn’t give the programmer that much information as core dump. A “core dump” (at least the one we are generating) includes information about the state of every single variable in the program. Exceptions cannot offer this: exception objects would be too huge. Also a core dump shows us the state of the program immediately after we detected the bug. In contrast, unwinding the stack while throwing an exception changes the call stack, so at the catch site we do not have the full information about the program state in the moment the bug was observed: some information has been already lost.

      Another thing to consider when deciding whether to throw exception exception (or use a return error code) or to halt the program is the question whether the program should be allowed to recover from the situation we are facing. If we suspect recovery may exist (this is usually the case for “faults”) throwing is a better choice. Even if no reasonable recovery exists but we see there is no harm trying to recover (this is the situation where the program lacks a critical system resource and cannot perform its basic operations), we should prefer to throw. But there are situations when we know up front that no recovery exists and letting the program run risks damage (these are the cases where we detect program is doing things counter to its specifications), termination appears like a preferable action.

      Regards,
      &rzej

    • You make some good points so I’m going to try and approach some of these issues in a few more articles. I don’t think I can respond adequately in this reply space.

  3. I think the canonical case for abort is kernel panic, albeit even this is something that can vary a lot depending on the particulars of the kernel architecture…

    • The basic principle seems pretty straightforward: the decision to abort due to an error condition should be done at the highest level possible. In particular if one is writing an API for external consumption such as LLVM in the example, having it abort your program for you seems unjustified. Even in the case of a life critical application, one can design parts of the functionality as components. Of a given component encounters an error, it may make sense to pass that error up to a higher level of the application. That could give the application a chance to gracefully shutdown the component and perhaps initialize a new one. Obviously what should happen depends on the specific situation, but that seems like a reasonable convention for most cases.

    • Yes, a modular design requires that only the uppermost level can make the a decision about abort. The modules themselves should not due this.

    • Andrez does have a valid point though. There are always going to be cases where the whole application should be shut down as soon as an invalid internal state is detected for fear of the higher level code not doing the right thing. However I think this level of “paranoia” should only ever be considered when an app truly is life critical as in his example.

    • Be sure to note the difference between “shutdown” and “premature termination”. Without a doubt some systems will encounter conditions where immediate shutdown is required. In virtually all of these cases a defined procedure must be followed, and “abort” is not likely to be a valid procedure.

    • @levinv: Let me add a couple of comments atop of what you said.
      True that the component who spots the state corruption may not be (and usually is not) in the right position to decide what to do. However, this problem is not necessarily solved by passing an information higher and higher. An alternative is for all the components in your system to call a callback function when they detect state corruption. This callback is defined globally by the application in one place and controlled by the guy that assembles the whole program form the components and can correctly asses what action to take. This solution is adapted by Boost.Assert: their libraries call the custom function, but you decide what this function does.

      Even in the case of a life critical application, one can design parts of the functionality as components.

      — You can also consider a C++ program one such component. It detects a failure, calls std::terminate (a better practice than calling std::abort), which releases system-critical resources, and possibly requests a re-launch of the program. In this sense, calling std::abort or std::terminate is not the end of the program, it can be auto-restarted in a second. And it gives you the opportunity to reset every object (including globals) of the program state. (I realize not every system can afford this).

      I would be anxious just to throw an exception in case we detect state corruption, because stack unwinding is the process when we call a number of functions (from destructors, scope guards) and these functions are likely to rely on the application’s valid state.

  4. It seems there is indeed enough discussion here to warrant a followup article or two. :) Can include things like: When is it okay/not okay to exit a program due to an error condition; strategies for safely exiting a program due to an error ( including responsibility, when an exception is not a safe way to propagate/handle errors, how much state can be preserved etc. as per Andrzej’s comments); possibly also an article about such issues when some form of concurrency is involved, etc

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s