The Life of a Programmer

Abstract Machines, Interpreters and Compilers

 

Abstract machines are the core of what defines a programming language. A wide variety of operating systems and plethora of hardware ensures a bewildering number of things we call computers. When we write a program, we don’t want to think about the differences between all these platforms. We’d like to abstract away from all that and worry only about our problem domain. To do so, we choose some portable programming language: something which doesn’t care about what hardware it is running on.

If we don’t say what machine our program is running on, how do we know what the code is doing? A program is nothing more than a series of instructions; they have to control something. To avoid dependence on a specific piece of hardware, we invent a new machine. We define the characteristics of this machine and then describe the language as a way to manipulate it.

This machine is called the abstract machine. The C++ specification explicitly mentions this machine, as does the Java standard. Python introduces this machine under the term “core semantics”. In academia, both Turing Machines and Lambda Calculus are examples of abstract machines. Not all of these are described explicitly as machines, but rather as a series of behaviours and rules the language follows.

Simple Example

Let’s say we have a language called “Simple” which supports the following construct.

a = 5
b = 3

The specification simply says this stores the value of 5 in a variable called a and 3 in b. Later we have the following code.

c = a + b

The specification for Simple defines the above code to: load the values of a and b; add those values together; and store the result in c. To finish it up we write this code.

print c

This prints the value stored in c. In our case this should print 8.

These rules describe the basic mechanisms of our abstract machine. The machine can store a value with a name and retrieve that value at a later time. It can perform simple addition and store the result. It must also be able to display values to the user. It doesn’t say how to do any of this. There is no mention of a stack, a CPU, operating system, nor any notion of compilation. It doesn’t care so long as 8 is printed.

An Interpreter

If we wished to use the Simple language, we’d have to create an implementation of its abstract machine. The direct approach is an interpreter: a program that behaves like the abstract machine. That is, in some other language we’ll write a program which parses Simple code and processes it. Let’s take a quick look at what this program would have to do.

We can implement our memory model with a map to store variables. The name of the variable is the key, and we’ll assume the values are integers. Assigning to a variable means storing that value to a given key in the map. Loading a variable means looking up a value in the map.

To process the code, we read the source file, parse each line, and call a handler function. An assignment expression writes to the memory map. An addition statement reads two values from the map, adds them, and stores them back in the map. The print function is delegated to a console write on the operating system.

Once we’ve written our interpreter, we can give it the above code, and it will display the result 8.

There, we’ve created an interpreter for Simple’s abstract machine. We’ve taken the rules of the Simple language and created a program which understands them. This interpreter is capable of running any Simple program. Languages like Perl or Python kind of take this approach — we’ll get back to why only “kind of”.

The terminology can get a bit fuzzy here. Our interpreter could also be called a virtual machine. It’s a guest program, running on a real host machine, which runs according to the rules of the abstract machine.

A Compiler

Suppose we want to run directly on an actual chip, without an interpreter. We can’t change how the chip works; thus we can’t create our abstract machine directly. Instead, we’ll translate the source code into new code for the machine. This is what a compiler does.

We need to translate the memory model, as the chip refers to memory by index, not by name. To work on the chip, we need to convert all of our names into indexes — enumerate all the names used in the program and give them a sequence number. Anywhere a name is used we substitute this index instead. The assignment statement is translated into a machine “move” instruction, assigning a value to memory.

Possibly our chip doesn’t have an add function that works directly with memory indexes. We will have to first move the data into a temporary location, call the add instruction, and move the result back to the target index.

I’ll gloss over the print function a little bit; it isn’t so important how it works. In brief: our compiler will have to emit code which takes a value, formats it as a string, and then somehow copies it to the display.

Our compiler translates source programs into something that runs on the target machine. Unlike with the interpreter, our source code is essentially gone at this point. The Simple language, and its abstract machine are no longer relevant: we now have a program coded for the target machine. Despite this translation, the program still prints 8, exactly as the rules of the abstract machine said it would.

As-If

While the language specification has to dictate a lot of rules about how the abstract machine works, not all of these rules are directly visible to the user. Only a handful of the behaviours will lead to something being written on the screen, to the disk, or some other peripheral. These are what are known as “observable effects”.

Something curious happens here. If there is no way to determine how the program was processed other than these observable effects, these effects themselves become the only relevant result. So long as these results are correct, it doesn’t matter how they were obtained.

This is often called the “as if” rule. A compiler, or interpreter, can do whatever it wants provided the observable effects are “as if” the program were run on the abstract machine. This is very similar to any library you might be using. You expect the functions to produce a particular result, but don’t care too much about how they get to that result.

The “as-if” rule is part of the reason why the abstract machines are “abstract” or lacking in fine details. The more concrete a specification is, the harder it becomes to provide the appropriate behaviour. The mechanisms for memory and execution are left flexible so an efficient implementation can be provided.

Languages like C and C++ standards often mention “undefined behaviour”. This behaviour results from source code that, while syntactically correct, does something not supported by the abstract machine, such as accessing unallocated memory. Instead of requiring this to result in an error, these languages instead say it’s undefined. A compiler can focus on efficiently implementing the things that should work. In the interest of safety and security, many newer languages avoid any type of undefined behaviour, requiring runtime errors in such scenarios.

The Intermediate Language

The compiler and the interpreter presented above ideal situations and don’t quite apply to most general purpose languages anymore. Trying to convert C++ code directly to a target machine is a bit difficult. Trying to interpret Python source directly at run time would be too slow. Language tools introduce a new intermediate language to address this problem.

The intermediate language is a proper language on its own. It has its own abstract machine and instructions which control it. Unlike a normal language it is usually more generic and lower level. The instructions are used as fundamental building blocks to express the higher level language.

The intermediate language doesn’t have a syntax in the traditional sense. The languages exists either as a series of in-memory structures or a serial binary form (otherwise known as byte-code). This makes it compact and easy to use directly without further parsing. It’s why we often call this an intermediate “represenation” (IR) instead of a “language”.

For debugging, and practical purposes, many IR’s have a text form. It won’t look much like the source language, and may resemble assembly language at times. Though possible, you generally wouldn’t want to write code in this form.

Intermediate languages vary in how far removed they are from the underlying hardware. In some cases it is really nothing more than a preparsed binary form of the source language. In other cases it may already be converted into a generic register model with a heap and stack, expecting to be run on a machine that offers these features. Even in the latter case it still has be compiled, or interpreted, before it can run on the final chip. The intermediate language is still defined in terms of an abstract machine.

This is why a language like Python is only “kind of” an interpreter. It’s CPython implementation is definitely interpreting your program, but not exactly as you originally wrote it. It compiles it first into bytecode and then runs that bytecode. The “.pyc” files are cached copies of that byte-code so it doesn’t have to compile the source each time. This bytecode is an intermediate language.

Some abstract machines have become full-fledged platforms in their own right. .Net’s Common Language Infrastructure is a standardized definition of a generic and common machine suitable for use by a variety of languages. While the primary users of this machine are the Microsoft languages like C#, or F#, a multitude of other languages can be compiled for it, including C++.

Conclusion

Abstract machines were created as a way to define what programs do independent of hardware. Specifications then describe their language in terms of this machine. This allows a program to run on a variety of platforms. For many languages, like C, this abstract machine usually doesn’t exist, programs are compiled directly to a target system. For other languages, like Java, code is compiled to byte-code to be run on a virtual machine (an implementation of the abstract machine).

Languages themselves usually don’t say how they should be executed. You can compile C to Java bytecode, or compile Java to a target machine. An abstract machine provides portability and flexibility.

From the source language through the abstract machines to the actual hardware, there is a lot of variety in what happens. The next time you are writing some code, take the time to think about what your code means at the level of the abstract machine? Which intermediate language does your code get represented as? Is the final form run by an interpreter, or is compiled directly for the target platform?

Addendum: C Example

Even so-called low-level languages like C provide a significant abstraction from the underlying hardware. Consider a simple case of a function that adds two numbers together.

int add( int a, int b )
{
    return a + b;
}

You as the programmer can think strictly in terms of the abstract machine and just add the two numbers together. In this particular code you don’t have to worry where these values are stored, or what the specific chip instructions are for adding values. C specifies the specifies that a function can receive parameters, return values, and expressions can be evaluated. It doesn’t say exactly how this is done.

The loose definition of the is what allows compilers to produce optimized code. Consider the below disassembly of the above function as optimized by GCC for x86_64. As much as you might expect a function to use the stack for parameters, this one doesn’t. GCC is using registers to pass the parameters. You could also note that it doesn’t appear to use an add instruction: the lea command is actually a memory indexing function.

lea    (%rsi,%rdi,1),%eax
retq

If you were to call this function in actual code the function call itself would probably not be made. This would just be inlined and the function would no longer exist at runtime. If you were to print the result of this function however, the results would always be the same. That is what the abstract machine guarantees: the value printed has to be the resulting “as if” the function was called with two parameters.

The GCC compiler doesn’t convert the C code directly into the assembly code. It uses a series of passes and intermediate languages. Its first intermediate form is called Generic. From here it is reduced to Gimple, at which point a lot of the optimization is done. Only in the near to last step does it leave the realm of abstract and become assembler for the target platform.

While GCC does all this using an in-memory form, the clang compiler defines an actual byte-code. Typically this byte-code is, as happens in GCC, compiled to a program on the target platform. Clang is a front-end for LLVM which offers a way to store this binary and finish compilation at run-time via a JIT compiler. The LLVM intermediate language is promoted as a generic system, encouraging people to write their own languages using this abstract machine.

Please join me on Discord to discuss, or ping me on Mastadon.

Abstract Machines, Interpreters and Compilers

A Harmony of People. Code That Runs the World. And the Individual Behind the Keyboard.

Mailing List

Signup to my mailing list to get notified of each article I publish.

Recent Posts