What a compiler does: internal types

Tags

,

If we scratch the surface of a language a bit we find a secret, rich world of type information. Our compilers know more about our code than we do ourselves. In my previous article I looked at how conversion of basic visible types are done. Here I look at a few of the myriad of internal types playing an important role in conversion and optimization.

const, readonly and literal

There are significant differences between variables that cannot be modified, will not be modified, or have a permanent compile-time value. While some languages have the const or final keyword, they are insufficient to distinguish tre true nature of the data. Consider the below C code:

1
2
3
4
int const a = 14;

void foo( int const b ) {
}

The global value a, of type int const, is known at compile-time and can never change. In foo the type of b is also int const, yet it describes something different. Clearly b cannot be known at compile-time. For a compiler this difference is important and thus it must somehow label these types as more than just const.

Internally we might find that a is of type int literal and b is of type int readonly, where literal is a value known at compile-time and readonly is a dynamic value, but one that can only be read in this context. These types are used by the compiler to check the validity of expressions using these variables. They are also used by the optimizer. A literal value, one which is known at compile-time, can be directly inlined where it is used.

immutable

Closely related are immutable types, such as Java’s BigDecimal. The keyword immutable is not exposed in the Java language, but I suspect the compiler, or VM, does actually track this status. Knowing that a parameter is immutable gives the optimizer a lot more flexibility in dealing with the value. Consider the following pseudo-code:

1
2
3
4
5
foo( type1 a, type2 b ) {
    d = costly expression using a and b
    ...
    e = same costly expression using a and b
}

If the compiler knows that a and b are immutable, it doesn’t need to do the costly expression twice, it can reuse the value of d for e. Thus it makes sense for the compiler to track the immutable type status.

It’s too bad that immutable isn’t exposed at the source code level. Being able to mark arbitrary parameters and data as immutable would eliminate a large class of common programming errors.

lvalue and rvalue

Consider the following code in C:

1
2
int a;
a = 10;

a has type int. The constant 10 also has type int. Something seems wrong. If both of those truly have the same type the following assignment should also be possible:

1
10 = a;

But clearly that is wrong. To allow assignment the compiler gives a the type int lvalue. An lvalue represents a location of actual storage. The name comes from “left value” as it’s the part that appears to the left of the assignment operator. For contrast there is also an rvalue type, the part that appears on the right.

Conversion between lvalue and rvalue is perhaps one of the most common conversions done by a compiler, and it’s one that we generally ignore.

Consider what happens in the following code:

1
2
3
4
5
int * get_ptr() { ... }

int b = 10;

get_ptr()[4] = b;

get_ptr returns a normal int *, from which we want to assign to the value at offset 4. To do this we need an lvalue expression, but don’t yet have one. The index operator [] can resolve to a int lvalue type since it’s pointing to a region of memory. This works even though the pointer itself is not an lvalue.

When the compiler first encounters b it assigns it the type int lvalue. It gets lvalue since it is a variable that can be assigned to. However, since it appears on the right of the expression it needs to be converted to an rvalue. The conversion from rvalue to lvalue is implicit in virtually all languages.

The conversion from lvalue to rvalue may or may not represent an actual instruction in the compiled code. In some cases the lvalue will be tracked as a pointer and must be dereferenced to get the rvalue. In other cases it may just be a virtual reference to a register, and thus not actually need dereferencing. It’s very hard to make generalizations since the optimizers do a lot of weird and wonderful things.

For the purpose of optimization C++11 standardized several additional value types: xvalue, glvalue and prvalue. These are used to support the concept of rvalue references, and allow move operations to take place. A normal user won’t need to worry about this typing much as generally it works in a logical way: the code behaves like it looks.

pure functions

Functions are also subject to internal types. The basic rules of languages don’t usually give enough attention to precisely what a function is doing. Does it modify global state, or does it only read from it? Perhaps it uses only the parameters to calculate a value.

In GCC there are a plethora of attributes that can be given to functions. Not all of them could be reasonably considered part of their type, but “pure” and “const” certainly can be. Unfortunately, since neither C nor C++ allow extended function types, much of this information is likely lost the moment you assign a function to a pointer variable.

Other Types

That’s just a brief look at some of the common internal types used by the compiler. Many are mandated by the language standards, but others are there to improve optimization. What appear to be simple statements in source code are actually rich in type information and undergo many implicit conversions.

If you’d like to learn more about compilers then follow me on Twitter. I have many more things to write on this topic. If there’s something special you’d like to hear about, or want to arrange a presentation feel free to contact me.

Follow

Get every new post delivered to your Inbox.

Join 318 other followers