Tags

, , , , , , ,

Abstract machines are the core of what defines a programming language. A wide variety of operating systems and plethora of hardware ensures a bewildering number of things we call computers. When we write a program we generally don’t want to think about the differences between all these platforms. We’d like to somehow be abstracted from all that and simply worry about our problem domain. To do so we choose some portable programming language. Something which doesn’t care about what hardware it is running on.

But if we don’t say what machine our program is running on, how do we know what the program is actually doing? A program is nothing more than a series of instructions; they have to control something. To give them something to control, yet avoid dependence on a specific piece of hardware, we invent a new machine. We define the characteristics of this machine and then describe the language as a way to manipulate that machine.

This machine is called the abstract machine, or sometimes the virtual machine. The C++ specification explicitly mentions this machine, as does the Java standard. Python introduces this machine under the term “core semantics”. In academia both Turning Machines and Lambda Calculus are examples of abstract machines. Not all of these are described explicitly as machines, but rather as a series of behaviours which all together define that machine.

Simple Example

Let’s say we have a language called “Simple” which supports the following construct.

a = 5
b = 3

The specification simply says this stores the value of 5 in a variable called “a” and 3 in “b”. Later we have the following code.

c = a + b

The specification for Simple defines the above code to: load the values of “a” and “b”; add those values together; and store the result in “c”. To finish it up we write this code.

print c

This prints the value stored in “c”. In our case this should print “8”.

These rules describe the basic mechanisms of our abstract machine. The machine has a way to store a value with a name and retrieve that value at a later time. It can perform simple addition and store the result. It must also be able to display values to the user. It doesn’t say how any of this is done. There is no mention of a stack, a CPU, operating system, nor any notion of compilation. It doesn’t care so long as the value “8” ends up being printed.

An Interpreter

If we wished to use the Simple language we’d have to create an implementation of its abstract machine. If we take the approach of an interpreter we will be creating a program which essentially becomes the abstract machine. That is, in some other language we’ll write a program which parses Simple code and processes it. Let’s take a quick look at what this program would have to do.

Storing variables can be done with a map. The name of the variable is the key, and we’ll assume the values are integers. Assigning to a variable means storing that value to a given key in the map. Loading a variable means to read a particular key value from the map. This is the Simple memory model.

To process the code we read the source file and parse each line. Depending on what we parse we call one of our handling functions. An assignment expression can be handled simply by using the memory map. If we encounter an addition we get two values from the map, add them, and store them. For the print function we’ll call a print function from our implementation language.

Once we’ve written our interpreter we should be able to paste the above code in it and it will display the result “8”.

What we’ve actually done is created the abstract machine. That is, we’ve taken the rules of the Simple language and created a program which understands them. This interpreter is capable of running any Simple program. Languages like Perl or Python kind of take this approach — we’ll get back to why only “kind of”.

A Compiler

Suppose we want to run directly on an actual chip, without an interpreter. Obviously we can’t change how the chip works, thus we can’t create our abstract machine directly. Instead we’ll translate the code into new code for the machine. This is what a compiler does.

The first thing we need to translate is the memory model. The only way the chip refers to memory is by index. Thus to work on the chip we need to convert all of our names into indexes — simply enumerate all the names used in the program and given them a sequence number. Anywhere a name is used we can use this index instead. Assignment can then be translated into a machine “move” instruction to assign a value to memory.

Chances are our chip doesn’t have an “add” function which works directly with memory indexes. Instead we will have to further “move” the data into a temporary location, call the “add” instruction, and “move” the result back to the target index.

I’ll gloss over the print function a little bit, it isn’t so important how it works. In brief: our compiler will have to emit code which takes a value, formats it as a string, and then somehow copies it to the display.

We can now use our compiler to translate our source program into something that runs on the target machine. You should notice something very different here than the interpreter. The result of compilation is not a generic abstract machine, but instead only the specific program we gave the compiler. Instead of running the Simple code, we’ve translated into code which should produce the same result.

As-If

While the language specification has to dictate a lot of rules about how the abstract machine works, not all of these rules are directly visible to the user. Only a handful of the behaviours will actually lead to something being written on the screen, to the disk, or some other peripheral. These are what are known as “observable effects”.

Something very curious happens here. If there is no way to determine how the program was processed other than these observable effects, these effects themselves become the only relevant result. So long as these results are correct it doesn’t really matter how they were obtained.

This is often called the “as if” rule. A compiler, or interpreter, can do what ever it wants provided the observable effects are “as if” the program were run on the abstract machine. This is very similar to any library you might be using. You expect the functions to produce a particular result, but don’t care too much about how they get to that result.

This is part of the reason why the abstract machines are so incomplete in their definition. The more concrete a definition is the harder it becomes to provide the appropriate behaviour. Therefore the mechanisms for memory and execution are left very flexible so an efficient implementation can be provided.

The Intermediate Language

The compiler and the interpreter presented above are kind of ideal situations. Ones that never quite apply to most general purpose languages anymore. Trying to convert C++ code directly to a target machine is a bit difficult. Trying to interpret Python source directly at run time would be too slow. So most language tools introduce a new intermediate language to address this problem.

This intermediate language is usually a proper language on its own. It has its own abstract machine and instructions which control it. It differs in two key aspects. It is usually syntax-free in the traditional sense. The languages exists either as a series of in-memory structures or a serial binary form (otherwise known as byte-code). This makes it compact and easy to use directly without further parsing. It is usually more generic and possibly lower level. The instructions are typically tailored to the high level language but many forms will typically require conversion into more fundamental building blocks.

Intermediate languages vary in how far removed they are from the underlying hardware. In some cases it is really nothing more than a preprocessed binary form of the source language. In other cases it may already be converted into a generic register model with a heap and stack, expecting to be run on a machine that offers these features. Even in the latter case it still has be compiled, or interpreted, before it can run on the final chip. That is, the intermediate language is still defined in terms of an abstract machine.

This is why a language like Python is only “kind of” an interpreter. It’s CPython implementation is definitely interpreting your program, but not exactly as you originally wrote it. It compiles it first into bytecode and then runs that bytecode. The “.pyc” files are simply cached copies of that byte-code so it doesn’t have to compile the source each time. This bytecode is an intermediate language.

Some abstract machines have become full-fledged platforms in their own right. .Net’s Common Language Infrastructure is a standardized definition of a generic and common machine suitable for use by a variety of languages. While the primary users of this machine are the Microsoft languages like C#, or F#, a multitude of other languages can be compiled for it, including C++.

Conclusion

Abstract machines were created as a way to define what programs do independent of hardware. Specifications then describe their language in terms of this virtual machine. This allows a program to run on a variety of platforms. For many languages, like C, this abstract machine usually doesn’t exist, programs are compiled directly to a target system. For other languages, like Java, code is compiled to byte-code to be run directly by an implementation of the abstract machine. Though the language doesn’t say how it will run. You can actually compile C to Java bytecode, and compile Java to a target machine.

From the source language through the abstract machines to the actual hardware, there is a lot of variety in what actually happens. The next time you are writing some code, take the time to think about what you are actually writing. What does your code mean at the level of the abstract machine? Which intermediate language does your code get represented as? Is the final form run by an interpreter, or is compiled directly for the target platform?

Addendum: C Example

Even so-called low-level languages like C provide a significant abstraction from the underlying hardware. Consider a very simple case of a function that adds two numbers together.

int add( int a, int b )
{
	return a + b;
}

You as the programmer can think strictly in terms of the abstract machine and just add the two numbers together. In this particular code you don’t have to worry where these values are stored, or what the specific chip instructions are for adding values. C specifies the specifies that a function can receive parameters, return values, and expressions can be evaluated. It doesn’t say exactly how this is done.

The loose definition of the is what allows compilers to produce optimized code. Consider the below disassembly of the above function as optimized by GCC for x86_64. As much as you might expect a function to use the stack for parameters, this one doesn’t. GCC is using registers to pass the parameters. You could also note that it doesn’t appear to use an “add” instruction: the “lea” command is actually a memory indexing function.

lea    (%rsi,%rdi,1),%eax
retq

If you were to call this function in actual code the function call itself would probably not be made. This would just be inlined and the function would no longer exist at runtime. If you were to print the result of this function however, the results would always be the same. That is what the abstract machine guarantees: the value printed has to be the resulting “as if” the function was called with two parameters.

The GCC compiler doesn’t convert the C code directly into the assembly code. It uses a series of passes and intermediate languages. It’s first intermediate form is called Generic. From here it is reduced to Gimple, at which point a lot of the optimization is done. Only in the near to last step does it leave the realm of abstract and become assembler for the target platform.

While GCC does all this using an in-memory form, the clang compiler defines an actual byte-code. Typically this byte-code is, as happens in GCC, compiled to a program on the target platform. Clang is a front-end for LLVM which offers a way to store this binary and finish compilation at run-time via a JIT compiler. The LLVM intermediate language is promoted as a generic system, encouraging people to write their own languages using this abstract machine.

About these ads