Tags

, , , , , , , ,

Some languages hand us sharp knives and encourage us to play with them. Other languages put us in padded rooms and discourage us from doing anything at all. Though it may sound quite negative, it is often how the argument of language safety ends up being portrayed. But we can’t make sense of either of those views until we address the fundamental question: What does it even mean for a language to be safe?

It is a tricky question to answer. Different camps have different opinions. Often the topic is restricted solely to type safety, which is only one aspect of language safety. How the program interacts with the machine and how it combines with other code is also important. Here I will discuss some of the major aspects of what could be considered language safety.

While this could be used to check the level of safety a language offers, do not confuse this with the quality of the language. Safety is an ideal, but not the ultimate driving factor in language design. I’m not even sure it is possible to create a language which is completely safe and useful at the same time.

Implementation

It isn’t enough to have a language which is safe in theory, it must also be safe in practice. In this sense what must be safe is not just the language, but the language implementation (also known as its environment). While the specification may not have to give concrete details of how compilation, interpreting, or deployment is done, it must provide for a framework where this is possible without loss of safety.

This is extremely relevant since it is usually practical considerations which drive language design away from being safe. If an implementation lacks required functionality to be useful, or it ends up being too slow, then nobody will actually care if the language itself is safe. Our goal is to create usable computer programs. Thus the safety features of a language should not be regarded separately from the actual implementation of that language.

Protected Abstract Machine

Languages are specified in terms of functionality on an abstract machine. This is the virtual hardware on which the machine is run. It defines the fundamental mechanics of the language such as its memory model and execution flow. Your program should not be able to break these mechanisms. This is much like a protected operating system where the user space programs cannot break the kernel or other processes.

What concretely this means depends on the language, but there are a few common aspects. You should not be able to corrupt the stack or the heap. Things like buffer overflow or double free must either not be possible, or detected and blocked. All the internal book keeping needed by the abstract machine must always be safe from invalid modification. Corrupt internals naturally lead to undefined behaviour, which is of course not safe.

Your execution should never diverge from your program code. Again here we have stack corruption, but also attempting to invoke member functions on partially initialized classes can do this. Calling functions via a pointer, which is invalid, is another common defect. Ideally a safe language would not even allow such an error to happen, but at the least it should somehow be detected and blocked.

Type Safety

Perhaps the most discussed aspect of languages is that of type safety. Every language presents its own view on what it means and how it is implemented. Many discussions of type safety actually leave that realm and enter the one we’re discussing now (general language safety). Let’s just summarize type safety as simply not being able to do something with a type that doesn’t logically make sense. For example, assigning a string to a integer variable, using a Service object as a File object, or passing a complex number as a double parameter.

Whether the type safety is provided statically (compile-time) or dynamically (run-time) is a relevant aspect. In terms of safety, static checking has the advantage of removing surprises. If the code compiles you know you don’t have type errors. Of note again here, this article is about safety, so I’m not saying that static typing is better, only that it is probably safer.

Type safety helps protect you from modifying your data in ways you don’t intend. It is also one of the techniques used to protect the abstract machine, though in itself is insufficient. It is always a matter of degree: type safety is generally a trade-off between safety and performance (either running time or compilation speed).

Clarity

Generally when we program we assume what we write will be executed as we intend. If the syntax of a language is too complicated, or does not well match our domain, it can often be difficult to ascertain what exactly is happening. You can’t exactly rely on the program executing correctly if you can’t be certain of what you’ve written. A safe program must remain safe through years of iterative development.

Code which is hard to decipher is often misunderstood and can lead to defects. If subtle variations can produce greatly different effects I would argue it is not a safe language. Also if the syntax to semantic ratio is too high (large number of symbols to accomplish little) the intent of the code can easily get lost. Advanced C++ templates is an example of this. Mismatched or inconsistent syntax is also an issue. For example in Haskell the curried function signatures don’t actually match the non-curried function definition.

Obviously nothing can prevent a careless programmer from making mistakes, but a language should avoid hiding things which are relevant. One syntactic counterexample is Python, where visually undetectable changes in whitespace can alter the scope. One semantic counterexample is C++ where without seeing a function signature it is impossible to know if a parameter is passed by value or by reference (the call side syntax is identical).

There’s a bit of subjectivity here and it also relates a lot to programmer experience. Clarity is nonetheless an important aspect of safety: it’s very hard to know if the code is correct if you don’t know what it is doing.

Resources

Computers still have limited resources thus efficient management must be possible. The choice to include this under safety is a bit tenuous, but I include it since resource exhaustion often has catastrophic effects on both your program, and other programs on the system. It is akin to breaking the abstract machine, which we do consider a violation of safety.

A language needs reasonable ways to ensure resources are not being overused or leaked. Obviously it can’t prevent you from intentionally doing this, but it should attempt to eliminate the cases where you don’t mean to. For example, if you use a lot of shared pointers in C++ you could end up with loops in your object references. Though you think you’ve freed the objects, those internal pointers will keep them around, along with any resources they use. While a scanning garbage collector could deal with the loops it has another problem: it doesn’t support an RAII pattern. Thus if you do any loop involving limited system resources, like file handles, you can quickly run out since the collector hasn’t yet cleaned up the previous ones.

This is still an area where all languages rely heavily on programmer responsibility.

Error Handling

Errors are inevitable and a program must be ready to deal with them. Any system function can fail for a large number of reasons. The input to the program can be invalid. The operating system can generate events, exceptions, or signals which need to be handled by your program. Internal errors and defects can also be detected by your code itself. In a multitude of ways your program can find itself in an error situation.

Somehow you will need to deal with these errors. Primarily you need to detect the error. Once detected you can either correct the error, consume the error, or pass it along. The complexity of this syntax relates a lot to the point about clarity. You don’t want the intent of your code to get lost within the error handling. As the complexity of error handling increases so does the desire to just avoid it altogether. Not handling errors will of course lead to very unsafe programs.

This is one area where I don’t think any language has it right. C++, Java, and Lisp all use exceptions, though with significant differences. Go’s designers have primarily returned to C style error codes. Haskell combines error checking with its type system via error carrying monads. So we have a multitude of attempts, something like exceptions is needed, but nothing yet feels correct, or natural.

Packaging and Libraries

Programs tend not to be monolithic, rather employing a variety of shared code, either as an included file or a shared library. Some aspects are resolved at compile time, some at load time, and some even later at run time. This is one of the key areas where practical considerations must play a role in the language design.

Consider a simple scenario I experienced in Java. It’s possible, through using a variety of libraries, to have multiple definitions of a class, in my class it was the “Node” object. The loading mechanism was insufficient to detect this problem at runtime, allowing the code to run, but producing strange results when the wrong Node version was passed to a function. A similar problem results with loading the wrong shared library in the classic DLL Hell in Windows. During development this also comes up with Automake not recompiling certain outdated libraries.

Simply put, it is irrelevant how safe two individual modules are if they are incompatible. A language should provide a proper versioning system and prevent incompatible modules from being used, both at compile time and at run-time. Mismatched modules are a frequent source of defects in development. The worst kind are the minor differences which still allow the program to run, but produce incorrect results. This is a huge blow to the safety of a language.

Portable

Often left out from this discussion is the issue of portability. It must be included however as it is the source of a lot of defects. Despite languages having an abstract machine, there will often be differences between the various platforms on which it runs. A perfectly safe language would generally not alter behaviour, and provide detection mechanisms for when behaviour would be altered. In practice however this is not possible.

The exact nature of fundamental types is often not prescribed by a language. If it did, the emulation of these exact types becomes a huge performance hit on platforms which don’t natively support them. So things like real numbers become floating point, and what exact floating point standard is left to the chip designers. The size of integral types also varies from machine to machine, which changes the limits on those values. Part of this relates to type safety: the language must obviously use these machine dependent types, but it should be able to detect if they are insufficient for how they are used. The layout of variables in memory also varies to accommodate differing alignment requirements.

Concurrency complicates this a great deal. Even within a single chipset the timings can change enough to exhibit different race conditions. A lot of people also implicitly rely on the x86 memory ordering guarantees only to find their code completely fails on an ARM chip. Your algorithms could inadvertently rely on a specific cache coherency strategy, possibly suffering huge performance losses on other chips.

Detection is a big problem here. Ideally a compiler, or interpreter, would alert you to any platform specific behaviour. Yet in most cases it can’t. For simple issues like type sizes, static assertions, or constrained types (like in Ada) are an option. Many languages include portable concurrency constructs (like mutexes and atomics) but they still can’t detect when you should be using them. I’ve yet to see a satisfactory answer to the portability problem.

Is the language safe?

This discussion naturally leads to one of programmer responsibility. Will a reasonable programmer be able to write safe programs? Can a single programmer on the team inadvertently break the entire system? I don’t think we can rely entirely on competent programmers. We all have lapses and make mistakes at times. If simple mistakes can ruin the safety, then the language probably isn’t that safe. A language can, for practical reasons, offer unsafe constructs, but those should definitely not be the default, and must be clearly indicated. This is part of why clarity is so important.

Application domain is also a strong consideration. Different languages serve different purposes, and have different strengths and weaknesses. It is entirely reasonable that while a language may be safe in one domain, it is completely unsafe in another. Perhaps the type of safety required is just one more consideration in the choice of a programming language.

Ultimately I don’t think there is such a thing as a “safe” language. All languages make trade-offs between safety and flexibility and performance. If we wish to compare languages we have to include all the considerations I’ve listed above. But note that safety is not the goal; preventing defects is the goal. It probably doesn’t even make sense to consider the safety of a language as some kind of isolated feature.

About these ads