The Life of a Programmer

Search

Why do we use intermediate representations / languages?

Image of peacocks by GregMontani

When compiling code, it doesn’t directly go from source code to the final machine code. The compiler, after doing sematic typing and analysis, emits an intermediate representation (IR), also known as the intermediate language (IL). This in-between form of the code provides flexibility: it allows a compiler to support multiple languages, as well as providing a platform to support multiple target machines.

I’ll start by talking about IR as it’s used in system compilers, but the concept is much broader than that. You might be familiar with Microsoft’s CIL, the common intermediate language, which is target compiling from C#. Or perhaps you’ve used a game console emulator that runs old video games. We’ll see later how the lines get blurry.

I have found no sources that identify a clear difference between an intermediate “language” and “representation”. It seems entirely at the whims of whomever wrote the toolchain for a particular platform. Most references seem to use the terms interchangeably, or describe them in other terms. As we’ll see later, the lines between even high-level code, machine code, and IRs are blurry as well. There are definitely different classes, or purposes, but the term “language” or “representation” does not distinguish them.

Flexibility to machine code

When compiling to machine code, intermediate representations make it easier to support multiple backends, allow different language frontends, and support universal optimization.

Backend

The backend case is perhaps the easiest to understand. Say we want to compile some Fortran code for a x86-64 machine. Roughly, the steps we’d take are:

Now say we want to compile for ARMv8. We’d notice that some of the work we do is the same, so we replace the last step: we walk the AST and emit ARMv8 instructions.

There are two things we notice:

  • This is hard: It’s overly complex to convert from an AST to machine code, and a lot of the code to emit x86-64 and ARMv8-A is quite similar.
  • The optimizations are wrong: That “Optimize the AST” step seems to have assumed it was for the x86-64.

For the hard part, we notice that most CPU instruction sets are quite low-level and nowhere near the level of expression the AST allows. If our language is only as complex as 70s era C, then perhaps it’s not too bad. But languages added complex features and the gap between the AST and machine code grew significantly. We need to do a simplification step, something that takes our AST, produces a new form, and then emits the machine code from that.

On the optimization side, we see that we have tied our AST optimization too closely to the target architecture. That prevents us from reusing those optimizations, even if they are similar between platforms. Also, the AST is still quite high level and some optimizations are difficult.

So, combining these two bits of knowledge, we produce a new set of steps to compile the code:

  • Parse to an AST
  • Restructure the AST
  • Simplify the AST to an intermediate representation (IR)
  • Optimize the IR
  • Process the IR into machine instructions

The last processing step here is much simpler than the emission step we had before. This makes it a lot easier to plugin either our x86-64 or ARMv8 emitter.

For this discussion, I’m going to ignore the assembly stage, which pulls together all the bits of machine code with libraries to produce an executable file. Some toolchains may split this into an assembly and linking stage. If you were to implement an emitter, it would be important to you, but doesn’t alter the discussion here.

Application Binary Interfaces

The IR and the AST might still have machine specific things in them, as many system languages allow constructs that are tied to a specific machine. While parsing and processing the AST, the compiler may already use machine-specific information. This is effectively akin to defining new constants or adding compiler flags. It doesn’t affect the overall compiler architecture much. But the generated IR is not platform independent.

We also have to consider the operating system. Kernels and system libraries have specific ways of working with them. This includes calling conventions, exception setup, memory management and a host of other things. All of this makes up the application binary interface (ABI) of that platform.

Code emitted for x84-64 running Linux looks different than code emitted for Windows running on the same machine. This is another reason to want an IR, where the final emitter takes care of the details. These emitters also tend to be modular: we can mix-and-match operating systems and machines in a pluggable manner (at least, that’s the goal).

Frontend

Looking at our compilation steps, we make an interesting realization. If we want to support a new language, say C++, all we need to do is produce the IR. We only need to rewrite the initial steps, from code -> AST -> IR.

We’re going to hit a problem. We will find our IR can’t represent some feature of our language, such as C++’s exception mechanism. Or we come across a feature, like virtual table dispatch, which we can’t efficiently express, and as a result, we emit suboptimal machine code.

Architecturally, this isn’t a major hurdle. We can adjust the IR to accommodate our newly added language. The IR becomes a superset of features needed by all the languages it supports. As we have only a few major paradigms for languages, they share most of the same requirements. Thus, one IR can support numerous languages without too many specialty functions.

I fear a bit that these universal IRs also influence language design such that we shy away from features that are hard to implement in existing IRs.

We also have to adjust all the machine code emitters to support these new constructs. We don’t need to be universal though: not all instruction sets need to support all IR features. This is fine for compilers that emit machine code. Coders will know the limitations, and they can configure the compiler to disallow such code, making it easy to spot. This may be specific data types, trapping mechanisms, or simply limitations in registers and memory.

Note that reducing a language to an IR is not easy. It's hard. When I worked on Leaf, a direct conversion was causing issues, so I wrote an extra IR layer. I first emitted Leaf IR then converted that into LLVM IR.

Optimization

The IR provides a good place for universal optimizations. We don’t need to write a new optimizer for every language. This also allows optimizers to be written by coders that aren’t involved in the frontend parsers. The whole segmented architecture allows more coders to be involved in the various stages.

One issue with this generalized optimization is that we can miss out on language specific optimizations. The IR is never truly a superset of all the required features of all the supported languages. Some of those languages will have features the IR can’t express well, yet could have efficient machine code output. This creates a constant pressure between accepting these limitations, and adding new features to the IR — which in mature compilers can be costly, since it may involve changing many optimizers and all the emitters.

The large number of optimizations can help make up for this. While you lose optimizations in some places, you gain many optimizations in other places. It’s also somewhat subjective what optimal is.

Virtual Machines

Virtual machines (VM), used by languages such as Java and C#, also use an intermediate language. The compilers work generally the same as described before, but instead of emitting machine code, they emit the intermediate representation. The virtual machine then works with this form. When you download a Java or C# program, you’re downloading the IR.

There is a critical difference between the IR produced for virtual machines and the IR produced as a step to emitting machine code: the VM-IR needs to be platform independent. The purpose of Java’s VM, or C#’s CLR, is to take one program and let it run on multiple platforms. Therefore, the IR has to be compatible with all the platforms on which it runs.

This is like picking one particular OS and machine combination, then ensuring all IR is compatible with that platform. It will be the platform that the virtual machine provides . This becomes the abstract machine.

Because this target machine is more concrete, we often find higher-level functionality in the IR. We would, for example, have a way to allocate full objects and arrays in Java. Whereas in system language, the IR might know nothing about the heap, leaving you to call C-library functions on your own.

This is why the C-library is a key part of most operating systems. Not because programs are written in C, but because compilers are using the standard set of features it provides to fill in all the functionality that the IR doesn’t provide.

Just-in-time compilation

Even in an optimized form, running IR directly is not as fast as running machine code. For this reason, VMs convert parts of the IR into machine code. The VM itself is fulfilling that last step of machine code compilation. As this happens at runtime, the VM knows the operating system and host and can emit the correct machine code.

We call this just-in-time compilation, which is partially correct. The just-in-time part refers to only emitting machine code on the end-users machine. This also means that we don’t need to convert all code to machine code, some less used code can run on the VM still. It’s just-in-time and only-as-needed.

The compilation aspect is a bit, let’s say, nuanced. Most of the compilation work has already been done to get the IR. Emitting to machine code is a smaller part of the overall structure, and in the VM it’s both emitting machine code and doing the assembly and linking work — connecting it directly to the libraries it needs. Perhaps just-in-time assembling would be more accurate.

A blurry soup of intermediates

C as an intermediate language

Technically, you can use any language as an intermediate representation. For example, it wasn’t uncommon for compilers to emit C code, and let the C compiler do the rest. C compilers were, still are, a universal constant across operating systems and hardware. If we already have a C compiler for a platform, we can just use that instead of writing our own machine code emitter.

This became less common, and less necessary over time as compiler toolchains matured. However, developers still use C compilers for bootstrapping to compile the initial compiler. But with cross-compilation support, this isn’t even necessary for many systems.

It can be useful if you are writing your own domain-specific-language. Processing ASTs fully and converting to IR can be hard. If your language looks close enough to an existing one, emitting to that can save you a lot of work.

Machine code as an intermediate language

Machine emulators, such as a Nintendo emulator, take the code that was actually compiled to machine code and treat it as IR. Games written for the Nintendo were compiled to machine code and could run only on that machine. Emulators take machine code, implement an abstract machine that matches the Nintendo, and execute the code in a virtual machine.

Once there was a company called Transmeta that treated x86 machine code as IR on their physical hardware. Their CPU converted x86 code into a simpler native instruction set.

Though, fundamentally, machine code works as an IR, using it for this purpose can be problematic. Subtle differences in machine architectures can cause hefty performance penalties. The machine code also lacks a lot of the information that an IR contains, making it difficult to properly optimize for the new target architecture. The emulators need to worry about the specific behaviour of the old machine, rather than the intended behaviour of the application code.

Python and more

I want to point out Python as it has a model that is slightly in between full compilation and VM languages. Python is distributed as source-code, not as IR. But the first thing Python does with that source code is to produce its own IR, as “.pyc” files. It doesn’t have a just-in-time compiler, instead executing everything on its VM. At least, the standard Python distribution does, the PyPy implementation uses a JIT.

On that note, there are also compilers that take CLR and Java IR’s and emit machine code. Some of them take IR and emit WebAssembly IR instead, then distribute that in the browser. IRs are truly intermediate, the moment you make a successful one, somebody will take it and convert it to another one, or finish the compilation process.

As I mentioned before, though, IRs can’t always catch the nuances of the source languages, and thus each step can lose information, limiting what optimizations can be done. Some IRs are built for virtual machines that aren’t compatible with others, such as supporting scanning garbage collection. Conversion from one-to-the other, or emitting machine code, can be a daunting task.

Please join me on Discord to discuss, or ping me on Mastadon.

Why do we use intermediate representations / languages?

The form a compiler producers before it emits machine code or runs on a virtual machine

A Harmony of People. Code That Runs the World. And the Individual Behind the Keyboard.

Mailing List

Signup to my mailing list to get notified of each article I publish.

Recent Posts

Search