Tags

, , ,

Is there a difference between a “fault” and an “error” within a computer program? A number of people commented on my previous article suggesting this to be the case. I will try to demonstrate in this article that this distinction is generally not relevant in practice. First, what is an “error” and what is a “fault”?  A fault is deemed to be something that happens in the environment, say a lost network connection or damaged memory. Invalid  input by a user is also a fault, but it’s often considered as a special class. The conventional definition of an error is a mistake made by the programmer, such as passing the wrong parameters to a function or forgetting to initialize a data structure.

Common Example

Below is a small fragment of code. It represents a very common pattern which can be found thousands of times in any code base. We are using the return from one function as the parameter to another. I’ve intentionally chosen names which should make it somewhat unclear where these functions are defined: our code, third party code, or system code:

data_description foo_desc = get_data_description('foo');
data_set foo_data = load_data( foo_desc );

If ‘get_data_description’ were to fail we’d expect an exception or an error code. That is, we should not get to ‘load_data’ without a valid ‘foo_desc’. Instead of throwing an exception we can achieve the same guarantee with an ‘errno’ style check (I prefer using exceptions, but that’s a separate issue that doesn’t affect the logic of this example):

data_description foo_desc = get_data_description('foo');
if( errno != EOK )
    return; //let errno propagate
data_set foo_data = load_data( foo_desc );

As the programmer of this code we’ve made sure that we pass a valid ‘foo_desc’ to ‘load_data’. The ‘load_data’ code can then use this reference to open a file and load the needed data:

data_set load_data( data_description desc )
{
    file f( desc.get_primary_filename() );
    if( !f.readable() )
        //what now?
    ...

What do we do if the file is not readable as above? Some kind of error condition needs to be generated and propagated at this point. Is this an ‘error’ in the sense of a programming mistake, or is it a ‘fault’ in the sense of something happening in the environment? For now let’s say programming errors can be handled differently (perhaps via an abort). Clearly in the code above we won’t be passing an invalid ‘data_description’, since we did proper error checking.

But what if ‘get_data_description’ itself does something wrong? What if it returns an invalid description and fails to trigger an error condition? This would result in passing an invalid description to the ‘load_data’ function after all, which would in turn not be able to read the expected file. This feels more like a fault at this point, as our own code is fine. It is the code we are calling that did something wrong (which might be part of an external library). Or perhaps the external code is fine but the system deleted the file between the two calls. Regardless of what really happened, from the point of view of  ‘load_data’ there is no absolute way to know why the data description might be invalid.

Transitivity

Now let’s go one step further and add some more error handling into the original code:

data_set get_data_set( string name )
{
    data_description foo_desc = get_data_description( name );
    if( errno != EOK )
        return null; //let errno propagate
    data_set foo_data = load_data( foo_desc );
    if( errno != EOK )
        return null; //let errno_propagate

    errno = EOK;
    return foo_data;
}

If that final check of ‘errno’ fails, what does that tell us? It tells us little  more than the fact that our requested data could not be loaded. Does this final check need to know why the loading failed? The point I wish to make is that our code cannot know whether this failure is due to programmer error or to a system fault.

Variables form dependency chains. In this code ‘foo_data’ depends on ‘foo_desc’ which in turn depends on ‘name’. If we trace back to the caller the chain may continue backward quite far. It can also continue forward, as the caller of ‘get_data_set’ depends on the ‘foo_data’ value returned here. At any point in this chain a minor fault, or error, can occur and will carry forward in the chain. Not all functions along the way will necessarily detect the problem, nor will explicit detection code always identify one. This minor problem may mutate for each link of the dependency chain.  Ultimately this fault arrives at code that depends in a tangible way on the integrity of the data. Inside the  ‘load_data’ function it discovers it cannot load a file. The function cannot know however who is responsible for this problem.

Conclusion

Our code does not live in isolation. We link it with a variety of libraries and execute the program in many different environments. The sources of error are wide ranging: The documentation for a library may be incorrect or misleading; or the library itself may not always function correctly. Occasionally there is a genuine hardware fault. Users have a way of finding clever and unexpected things to do, which programmers may not have anticipated. In turn such defects may become a part of libraries that will be used by other programmers.

From the local point of view within a given function the notion of “fault” versus “programmer error” is somewhat arbitrary. A function has no way of knowing with absolute certainty which event triggered the circumstances of its internal state corruption. Thus the resulting action should not depend on how we define a given condition. In general we can’t easily say that if something is a “programming error” the program should be aborted whereas if it’s a “fault” we  can recover.

Even if programming error could be clearly identified, why would this be of value? Who would benefit by treating such an error differently? Surely our goal as good programmers is to produce quality code which others can successfully use. This includes detecting errors and providing relevant information that helps identify and correct the source and to recover when possible.