Philosophy

Of Faults and Errors: Who’s to blame?

Is there a difference between a “fault” and an “error” within a computer program? A number of people commented on my previous article suggesting this to be the case. I will try to demonstrate in this article that this distinction is generally not relevant in practice. First, what is an “error” and what is a “fault”?  A fault is deemed to be something that happens in the environment, say a lost network connection or damaged memory. Invalid  input by a user is also a fault, but it’s often considered as a special class. The conventional definition of an error is a mistake made by the programmer, such as passing the wrong parameters to a function or forgetting to initialize a data structure.

Common Example

Below is a small fragment of code. It represents a very common pattern which can be found thousands of times in any code base. We are using the return from one function as the parameter to another. I’ve intentionally chosen names which should make it somewhat unclear where these functions are defined: our code, third party code, or system code:

data_description foo_desc = get_data_description('foo');
data_set foo_data = load_data( foo_desc );

If ‘get_data_description’ were to fail we’d expect an exception or an error code. That is, we should not get to ‘load_data’ without a valid ‘foo_desc’. Instead of throwing an exception we can achieve the same guarantee with an ‘errno’ style check (I prefer using exceptions, but that’s a separate issue that doesn’t affect the logic of this example):

data_description foo_desc = get_data_description('foo');
if( errno != EOK )
    return; //let errno propagate
data_set foo_data = load_data( foo_desc );

As the programmer of this code we’ve made sure that we pass a valid ‘foo_desc’ to ‘load_data’. The ‘load_data’ code can then use this reference to open a file and load the needed data:

data_set load_data( data_description desc )
{
    file f( desc.get_primary_filename() );
    if( !f.readable() )
        //what now?
    ...

What do we do if the file is not readable as above? Some kind of error condition needs to be generated and propagated at this point. Is this an ‘error’ in the sense of a programming mistake, or is it a ‘fault’ in the sense of something happening in the environment? For now let’s say programming errors can be handled differently (perhaps via an abort). Clearly in the code above we won’t be passing an invalid ‘data_description’, since we did proper error checking.

But what if ‘get_data_description’ itself does something wrong? What if it returns an invalid description and fails to trigger an error condition? This would result in passing an invalid description to the ‘load_data’ function after all, which would in turn not be able to read the expected file. This feels more like a fault at this point, as our own code is fine. It is the code we are calling that did something wrong (which might be part of an external library). Or perhaps the external code is fine but the system deleted the file between the two calls. Regardless of what really happened, from the point of view of  ‘load_data’ there is no absolute way to know why the data description might be invalid.

Transitivity

Now let’s go one step further and add some more error handling into the original code:

data_set get_data_set( string name )
{
    data_description foo_desc = get_data_description( name );
    if( errno != EOK )
        return null; //let errno propagate
    data_set foo_data = load_data( foo_desc );
    if( errno != EOK )
        return null; //let errno_propagate

    errno = EOK;
    return foo_data;
}

If that final check of ‘errno’ fails, what does that tell us? It tells us little  more than the fact that our requested data could not be loaded. Does this final check need to know why the loading failed? The point I wish to make is that our code cannot know whether this failure is due to programmer error or to a system fault.

Variables form dependency chains. In this code ‘foo_data’ depends on ‘foo_desc’ which in turn depends on ‘name’. If we trace back to the caller the chain may continue backward quite far. It can also continue forward, as the caller of ‘get_data_set’ depends on the ‘foo_data’ value returned here. At any point in this chain a minor fault, or error, can occur and will carry forward in the chain. Not all functions along the way will necessarily detect the problem, nor will explicit detection code always identify one. This minor problem may mutate for each link of the dependency chain.  Ultimately this fault arrives at code that depends in a tangible way on the integrity of the data. Inside the  ‘load_data’ function it discovers it cannot load a file. The function cannot know however who is responsible for this problem.

Conclusion

Our code does not live in isolation. We link it with a variety of libraries and execute the program in many different environments. The sources of error are wide ranging: The documentation for a library may be incorrect or misleading; or the library itself may not always function correctly. Occasionally there is a genuine hardware fault. Users have a way of finding clever and unexpected things to do, which programmers may not have anticipated. In turn such defects may become a part of libraries that will be used by other programmers.

From the local point of view within a given function the notion of “fault” versus “programmer error” is somewhat arbitrary. A function has no way of knowing with absolute certainty which event triggered the circumstances of its internal state corruption. Thus the resulting action should not depend on how we define a given condition. In general we can’t easily say that if something is a “programming error” the program should be aborted whereas if it’s a “fault” we  can recover.

Even if programming error could be clearly identified, why would this be of value? Who would benefit by treating such an error differently? Surely our goal as good programmers is to produce quality code which others can successfully use. This includes detecting errors and providing relevant information that helps identify and correct the source and to recover when possible.

12 replies »

  1. Hi Edaqua (is this how you prefer to be called?),
    Allow me to disagree with the main point of this post. I understand that its purpose is to show that distinguishing between a “fault” (an unusual or undesired occurrence in the environment) and an “error” is irrelevant.

    You show one example with opening a file. I am not sure if one example is enough to draw conclusions, especially that reading files is very particular case. For instance, two subsequent calls to function file_exists(name) may render different results.

    In general, at any given point, whenever a new value of any type is computed, we never can be sure if the value is the result of a bug or correct program behavior. For instance:

    double fun(double x)
    {
      double y = AbsoluteValue(x);
      return y;
    }
    

    Can we inspect value y and based on this determine if we have a bug or not? Perhaps we got this value and not another because we have a bug? If we inspect the value of errno and its value is 0? Can we be sure there is no bug? Perhaps the bug is in not setting errno to the appropriate error code. Any correct code can be accused of containing a bug, but there is no way to prove it. However, there are situations where we can determine with 100% certainty that we have a bug. You do this by planting assertions and preconditions in your code. (See here for assertion and here for precondition examples.) When you use them correctly you are sure that when an assert or a precondition or an ivariant fires, you have a bug. To show what “using correctly” means I can demonstrate what using assertions incorrectly means. For inctance calling file_exists(name) inside assertion is incorrect because the function is not pure: it may give different results on subsequent calls.

    The difference between a bug and a non-bug “fault” (in this case when you cannot tell which one it is you have to assume a non-bug) is that in the former case you know you have a program that is in an invalid state. In the latter you have a program in valid state working in a hostile environment. The difference is in valid versus invalid program state. If you do not pay attention to error vs fault difference from the start (of your project), the you are right that later on you my find the difference somewhat fuzzy. However, planting asserts, preconditions and similar from the outset, makes the distinction clear and well defined: it is your choice (to some extent) whether you want your program to make this distinction or not. You may say that you do not have control over third party libraries: whether they distinguish between errors and faults. This is true, and part of a more general statement: “a software component is robust (or correct) only if it is built of robust (correct) components”. You cannot make a reliable software from non-reliable components.

    It is worth to draw a line between a bug and a fault because some people may want to treat each differently. For example, I prefer to restart my program when a bug is detected. Someone may want ot inform the programmer (by sending email) about each detected bug. Even if you make a concious call to treat both bugs and faults in the same way you may still want to draw the distinction, only for you to be aware that in this case it is hostile environment and in that case it is your bug. If you are writing a component library do it (distinguish the two) for your users: they may want to treat bugs and faults differently. Drawing the line also helps you maintain the code. If you see code like:

    template<class T, size_t N, class V>
    std::array<T, N> to_array(const V& v)
    {
      std::array<T, N> d;
      if (v.size() == d.size())
        std::copy(v.begin(), v.end(), d.data());
      else
        d.fill(0);
      return d;
    }
    

    You do not know if any of the users relies on the “else” path or not (in the latter case you can remove the “else” part during maintanance. If the precondition is stated explicitly, there is no confusion:

    template<class T, size_t N, class V>
    std::array<T, N> to_array(const V& v)
    PRECONDITION(v.size() == N);
    

    Even if someone relies on the “else” part it is his error.

    Regards,
    &rzej

    • Thank you for the reply Andrzej.

      I’m quite okay with distinguishing precondition checks and invalid state checks (that is what my followup article is about). In that case the dinstiction is very useful. It is my contention that when a precondition check fails it is irrelevant whether it is a bug or a fault, the error handling can safely be the same. The same goes for recovery. (I fear our disagreement may be more about the terminology rather than the actual method to handle errors.)

      My problem with detecting bugs is that I don’t think you wish to detect just any bug, but rather bugs that you have introduced. That is, it is something that you have clearly and umabiguously coded incorrectly which leads to the error. This is as opposed to some third party library, or the OS having a bug, which would just be seen as a fault by you. Am I correct in this definition?

      My argument is based on the transitivity of code.It need not be a file, but basically any place where you call code which is not yours it taints your variables/state such that you can no longer be certain if you’re doing the right thing. Basically code is complex enough that I don’t honestly believe you can isolate “your bugs” from “other peoples bugs” at the point where the error is detected. This last point is important, on global inspection I have no doubt you can determine which are your bugs; my followup article would basically imply a sane approach to handling errors would not be able to distinguish.

    • “My problem with detecting bugs is that I don’t think you wish to detect just any bug, but rather bugs that you have introduced. That is, it is something that you have clearly and umabiguously coded incorrectly which leads to the error. This is as opposed to some third party library, or the OS having a bug, which would just be seen as a fault by you. Am I correct in this definition?”

      The way I see it, you ship your code along with the “harness” that keeps checking for your own bugs. For instance, imagine that I am writing a library that you will be using. I define the following function:

      double compute(double x)
      POSTCONDITION(return >= 0)
      {
        double t = std::abs(x);
        ASSERT(t >= 0);
        return std::sqrt(t);
      }
      

      It does two different things: it computes the desired value (sqrt(abs(x))) but at the same time it keeps performing correctness checks to see if I did not plant a bug myself. When you use my library, and call my function, you do not check my function for having a bug, but my code still does check for my bugs. Thus in the final program, you check for your bugs, I check for my bugs. In the end, every piece of code is checked for its bugs.

      In this sense, when I write a program, I expect that the components written by vendor X are checked for bugs inside by vendor X, and I am interested in these being reported as bugs (differently than faults). I expect that a third party library should be capable of telling two different things: (1) when it entered an invalid state or (2) when it must refuse to execute a request because of system resource deficiency, or other fault. If the library cannot offer this, I consider it an issue of code quality.

      —-

      “any place where you call code which is not yours it taints your variables/state such that you can no longer be certain if you’re doing the right thing.”

      That is a very pessimistic view. If this is really the case that you feel someone’s code “taints” your state, there is something wrong with this other code or yours: you are no longer in control of the code. In such case, indeed, there is no way of telling if something is a bug or not, but I do not think it is normal or healthy. It is a deficiency of the program that responsibilities or borders are not clearly set.

      —-

      “Basically code is complex enough that I don’t honestly believe you can isolate ‘your bugs’ from ‘other peoples bugs’ at the point where the error is detected.”

      Things like preconditions or postconditions help you set this border. For instance, if function’s precondition is violated it is definitely a bug in the caller rather than in the function. But I am not much interested in isolating ‘my bugs’ from ‘other peoples’ bugs’; I am more interested in isolating any bugs from non-bugs. And then dealing with all the bugs in a uniform way.

    • We are in agreement that distinguishing preconditions and entering invalid state is a valuable distinction to make. In my followup “How to handle an error” I say how this influences error handling, either unwinding or recovery. I also agree about trying to find and indentify all bugs in the system.

      I am saying however that when you identify a bug it is not relevant whether that bug is from a programmer error or a genuine fault. Partly this is because I think genuine faults are exceedingly rare (these are limited to actual physical faults of a hardware device, otherwise it’d be a programmer error). Partly this is because I don’t see how handling it differently improves software quality.

      Consider your example, which I know is just an example, but illustrates how easy it is to introduce ambiguity. What happens when I pass ‘NAN’ as a value to your function? Your ASSERT statement will fail, but does this indicate a bug in your code? No, indeed std::abs is working correctly, as is your function, the abs of NAN is NAN.

      When you make a check you can definitely determine that something is wrong, that is not in question. My core point is that isolated in that function you cannot identify the true source of the bug: your coding error, caller error, third party iibrary error, user input, OS issue, or an actual hardware fault. Trying to modify the error handling based on this classification will lead to difficulties in using the library, problems with reporting errors, and overall a reduction in quality, not an improvement.

    • I also believe that we agree in many aspects (perhaps all). I skip these, because, well, there is no point in discussing something we agree about :)

      Now I am starting to wonder if I understand what you mean by “fault”, and especially by “genuine fault”. Quoting your post:

      “A fault is deemed to be something that happens in the environment, say a lost network connection or damaged memory. Invalid input by a user is also a fault, but it’s often considered as a special class.”

      Ignoring invalid user input for now… “damaged memory” — do you refer to the case where you want your memory chip to store value “1” under memory address A, but due to some short-circuit or radiation from the outer space the value stored is actually “0”?

      And another question (to help me understand). How do you call the situation, where your program tries to allocate memory, but the OS replies that it cannot offer more memory at the moment? Or similarly, when you request launching a new thread, but OS replies that at present it cannot allow you to launch any more threads? Do you call it a “fault”, or is it something else (that is neither a fault or a bug)?

    • A genuine hardware fault is the class where I believe one could make a clear distinction, if one wanted to. A physical component of the device has failed to operate within its expected norms (like my damaged memory example). Absolutely everything above this point I don’t consider there to be a different between a bug or fault: they are all errors.

      For your example, the failure to allocate memory. This could happen for a myriad of reasons: you’ve exceeded the overcommit allowance, there is no block large enough, a policy limit prevents it, or in some rare cases a spin-loop has timed out and the kernel simply decides not to allocate it. When “malloc” returns 0 all you know is that you have an “error”. It may have been your own code which led to this situation (by using too much memory) or some other process (a DB committing every page). Or perhaps your user has just provided a file which is too big to be loaded.

      Same with a thread. At the point where the thread creation fails I have an “error”. Sure, on some global inspection of the system I can probably figure out why it happened, but in my code, in the function which creates the thread, I have only success or error. To make it more interesting, consider that your function takes input which affect how the thread is created. Perhaps the thread-creation is failing due to a particular combination of inputs. It’s quite possible you cannot check their correctness in advance, and also won’t know for sure afterwards. You’re just left with this “error” condition.

      Even go back to the hardware faults. It is rare that you will use hardware directly, even if you are doing very low-level programming. Usually you’ll still be going through some micro-code on the hardware itself. So if your drive writes fails, you can’t really know if it is due to a hardware failure, an error in the micro-code, a failure of the driver to handle an expected hardware fault, or you misreading the docs. Your code has detected an “error”, that is all you really know.

      In the same way I call all invalid inputs an “error”. Your function, at the point it detects the problem, cannot be certain why you were given the invalid input. Perhaps the other coder read the docs wrong, forget to check a condition, or you have a latent defect causing this transitive failure.

  2. So, I do believe we do disagree in something essential. I would like to explore what it is.

    You are saying that in the following code

    	
    void * memory = malloc(10);
    if (memory) {
      // (1)
    }
    else {
      // (2)
    }
    

    If I get to place (2) I cannot be sure if I got there because of some obvious bug or “genuine fault” or a “normal” run-time situation, where OS refuses to give me more memory, because it would exhaust limits allowed for me. I agree. Therefore wherever I cannot be sure if this is a bug or not, I assume it is not a bug. Note that I have a similar dilemma in place (1) in the code above. If malloc returns a non-zero value I also cannot be sure if it did it because it successfully allocated my memory, or because it has a bug. Perhaps it failed to allocate memory but it (buggily) reported a success? But thinking this way would be definitely a paranoia. I have to assume a non-zero result is not a bug.

    For similar reasons, I have to assume that (2) is not a bug. However, on the other hand, there are plenty of situations where I can be sure that I have detected a “genuine bug” and nothing else. Apparently, we disagree in this perception of “plenty” vs “hardly any”.

    Each time I put a precondition or an assertion, I make the call that I want the violation of this condition to be treated and reported as a “genuine bug” (or genuine fault) — but never as a “expected exceptional situation”.

    For instance, I could not use assert in the above example with malloc because I admit the possibility of this being a valid program behavior. So let me give you an example of what I consider a proper use of a precondition:

    double Sqrt(double x)
    PRECONDITION(x >= 0)
    {
      // ...
    }
    

    Whether to also filter out NaN, I consider a secondary problem right now. With this I am 100% sure that if some programmer decides to call Sqrt and gives me a negative argument, he is directly responsible for the bug, and it is his bug and nothing else:

    void fun1()
    {
      Sqrt(-1);
    }
    
    void fun2()
    {
      double d;
      std::cin >> d;
      Sqrt(d);
    }
    
    void fun3()
    {
      double d;
      std::cin >> d;
      Sqrt(std::abs(d));
    }
    
    void fun4()
    {
      constexpr double d = 4.0;
      Sqrt(d);
    }
    

    fun1 has a genuine bug: it should have never called Sqrt with a negative argument. fun2 has a genuine bug: since user has the freedom to enter any number, including a negative one, it should have never forward the input to Sqrt unchecked. fun3 does not have a bug: it made sure the user input is processed before passing it to Sqrt. If in fun4 for some magical reason d happens to be negative (perhaps due to a compiler bug), admittedly, this is not the programmers fault, but I still want it to be categorized as a broader class of a “bug”: either a programmer bug, or a bug in the tool or a “genuine fault” but never as run-time exception.

    Note that if I hadn’t express the precondition, I would be never able to make this distinction into a “100% bug” and a “probable bug”. It is only due to my decision that I am able to tell them apart. And I could have made a bug when forming the predicate in the precondition. But then a failing precondition would still be a bug: this time — mine.

    The key distinction (using the nomenclature of ‘Mike’ in one of the replies to my post: http://akrzemi1.wordpress.com/2013/01/04/preconditions-part-i/#comment-917) to be made is into “undesired” things that you expect and “undesired” things that you do not expect. For instance, I do expect that malloc might return 0. But I do not expect function Sqrt to be passed a negative argument. I pass this responsibility on the callers, and I state this explicitly with the precondition. Similarly here:

    int i = 0;
    for (; i < 10; ++i) {
      std::cout v[i];
    }
    assert(i == 10); // (A)
    

    I do not expect that i would ever be different than 10 at point (A).

  3. The problem I have with this line of arguing is that it requires all code to eventually look like below. Here the caller will be forced to do excessive input checking all the time.

    
    Object a = get_a_from_somehwere();
    
    if( not valid_for_foo( a ) )
      error;
    else
      foo( a )
    
    ...
    function foo( Object a )
    PRECONDITION( valid_for_foo( a ) )
    {
      ...
    }
    

    You have put onus on the caller to do the exact same checking that you will be doing in your function. First of this requires a lot of wasted duplication. Secondly, it makes the assumption that the caller can know with absolute certainty what the valid inputs to the function are — which I do argue is virtually never the case. This also assumes that preconditions can be fully checked, but a lot of the times it only during the actual processing where a problem is noticed.

    I do make the assumption that libraries will generally work as promised. This is exactly where my transitive argument comes from. If I make a long chain of assumptions, once I do detect an error, I cannot presume to know which part of the chain caused the error. Because I cannot presume to know this I have to assume that such errors will arise at runtime and that a reasonable programmer could not have avoided all of them.

    • I believe that to some extent the concern about the duplication is unjustified. I tried to describe the philosophy of dealing with preconditions in this post.

      The caller of foo does not have to check for meeting foo’s precondition himself: he only needs to guarantee that the precondition would hold. There are other ways of doing this. The post I am referring to gives such examples. Function foo itself also does not need to check for the precondition: sometimes it is not even allowed to (examples provided in the post). In languages with native support for contract programming it is the run-time that checks the preconditions (if you want it to) before function foo is called. Also, advance tools could perform static program analysis based on contracts.

      My experience is different than your impression. In most of the cases, for well designed interfaces you can specify a precondition. It is possible that it will not cover all invalid inputs, but the goal for the bug checks is not to detect all bugs: the goal is only to detect as many bugs as possible.

      One of the points I am trying to make is that it is possible to detect situations where we caught a bug with 100% certainty. You may have concerns regarding preconditions, but do you have any concerns regarding assertions?:

      int i = 0;
      for (; i < 10; ++i) {
        std::cout << v[i];
      }
      assert(i == 10); // (A)
      

      If an assertion fires I am sure it is a bug (in a wide sense: my bug, a bug in compiler or a “genuine fault”). I am not concerned whose responsibility it is: mine, yours, the compiler vendor’s — I do not care; however I insist that these should be reported separately of “expected undesired situations”, because some users may want to handle bugs differently.

    • Your “assert” in this sense is a post-condition check, and I agree these are distinct from pre-condition checks. They need to be handled differently as they imply some state is invalid. I do cover this in my next article. I think this distinction is very important at the point where the error is discovered, and yes I believe this distinction can usually be made.

      My problem largely stems from what “assert” will do. And again, this goes into my followup article. By making the assumption this is a programmtic error it allows you to do drastic error handling, whereby if you assume runtime error it forces a cleaner error handling here. You simply cannot know the situation in the caller and thus your best option is to propagate the error to the caller. Just as you assume your function will be called correctly, you will assume the caller handles errors correctly (or passes them on in turn).

      Even from your own post it is easy to see that somebody may no do validation and simply call checkIfUserExists. They expect preconditions to check any input invalid errors. The caller need not waste any effort checking since it knows the function will return an error if the input is invalid. The function has to do the precondition checks anyway, its the only safe options, thus I allow the caller to make the assumption that such precondition checks are indeed performed.

    • “My problem largely stems from what «assert» will do.”

      — I agree that the guy who detects the bug at run-time is not in a good position to decide how the program should respond to it. However, word “propagate” makes me uneasy because it implies (or am I wrong?) stack unwinding.

      There are situations where I would like to avoid stack unwinding in my program when bug is detected. A more general solution that would satisfy different expectations ranging from terminating immediately to throwing an exception would be to call a callback called, say, ON_DETECTING_A_BUG, and have the guy that assembles the entire program from components/libraries to make the call what this callback will do.

      Indeed, this comment should probably belong to your other post. On the other hand, here it is in the context of our comment exchange.

    • In the spirit of this post I think the callback should be called ON_PRECONDITION_FAILED, or ON_STATE_INVALID. This does belong to the other post, but all these posts are of to be considered together.

      If you generally agree the code that detects the error is not the best to decide what happens I think we can ignore any lingering disagreement about the “source” of that error (be it fault, bug, or otherwise).

      I am implying stack unwinding. Code which can’t unwind should be a rare exception. I think if you have cases where you disagree about that please put them on the other article (I don’t think anybody will read this far done on one post anymore)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s