One way to convert source code to machine code is via “basic blocks”. After parsing and semantic processing, and just prior to machine code generation, each of a program’s functions is split into a series of discrete blocks. Each of these blocks is a series of assembly-like code, usually a “three address code” such as LLVM IR. The special thing about a basic block is that each ends with a terminator: either a branching instruction or function return.
This article is a primer on a variety of situations in which a code tree is converted to a basic block flow. I start with the basics and then move on to more advanced flow control. For this discussion the fine details of what is in a block is not very relevant. I’ll focus primarily on the connections between blocks rather than on the instructions inside them. I’ll use the current Leaf syntax for the examples.
Basic Conditional
Let’s start with a very simple example. I’ll show both the code and one possible block construction. The Leaf conditional construct may look odd at first, but you can safely think of this as a typical ‘if(cond)’ structure.
[sourcecode]
()->() {
do cond ? {
[true-block]
} else {
[false-block]
}
[post-block]
}
[/sourcecode]
The conditional statement can be reduced to a very simple block flow. It is easy to see how the condition maps to two different paths which then flow back together.
If we were to ensure these are basic blocks we’d have to look at the exact instructions inside each one. They each need to end in a so-called terminator, and each of these blocks do: the condition block ends with a conditional branch; both the true and false block have an unconditional branch to the post-block; the post-block in turn has an unconditonal branch to the end. The end block would then contain a return instruction, which is also a terminator.
Add a defer
Defer statements are now a popular feature of programming languages. In C++ you have destructors; in Java/Python you have a finally clause; and in other languages you have an explicit “defer” statement. They all serve to execute some code at the end of the current scope.
[sourcecode]
()->() {
defer { [later-block] }
do cond ? {
[true-block]
} else {
[false-block]
}
[post-block]
}
[/sourcecode]
Compiling this defer statement seems relatively simple: insert the defer block just before the end block. In the next examples I’ll show how it gets complicated when combined with other flow control statments.
Add return
In the presence of defer statments, return statments cannot immediately end the current function. Instead they have to process the defer and then return.
[sourcecode]
()->() {
defer { [later-block] }
do cond ? {
[true-block]
return;
} else {
[false-block]
}
[post-block]
}
[/sourcecode]
In this diagram the “return flow” is marked with the green line. By “return flow” I mean the path which is followed at the point a return statement is executed. Notice that the “later-block” is entered with both the normal flow and the return flow. These two distinct flows exist until the end of the function.
Add another defer and early return
This example builds upon the previous one by adding another conditonal and another defer statement. No fundamentally new structure is required for these additions.
[sourcecode]
()->() {
defer { [later-block-a] }
do cond-a ? {
return;
}
defer { [later-block-b] }
do cond-b ? {
[true-block]
return;
} else {
[false-block]
}
[post-block]
}
[/sourcecode]
Alternate approaches to the same flow
I have mentioned that the basic block diagrams are “possible” structures. There are always other possible ways to compile the same flow. So long as the meaning of the original code is not altered a compiler can basically do whatever it wants. Indeed, the basic block structure is chosen by compilers because it is a good form for an optmizer. Since the logic at the basic block level is so trivial, the optimizer can make good decisions about what is happening and how to change it.
For example, start with this simple conditional which returns a value.
[sourcecode]
()->() {
var a = 5;
do cond ? {
a += 2;
} else {
a += 1;
}
return a;
}
[/sourcecode]
In the below diagram we can see how an optimizer might translate this structure. The compiler will at first emit the typical diamond shaped flow of a conditional. The optimizer then sees that it would be better to put the return directly in the true/false blocks: it is both less code and faster than branching to a common return. From here it sees that ‘a’ now has a constant value in each path, and it does that replacement. This again produces smaller code by eliminating a local variable and the addition instructions.
Many cases aren’t as clear as this one. Combining blocks could sometimes lead to improved speed, but at the cost of increased code size. Splitting blocks can often do the opposite: reduce code size but slower execution.
A simple loop
A loop creates a cycle in the block structure. If you’ve ever worked with graphs before, you already know how this change in topology greatly alters operations on that graph. For the compiler, this loop doesn’t pose much of a problem, but for the optimizer it becomes a critical aspect in its block analysis.
[sourcecode]
() -> () {
loop cond ? {
[loop-block]
}
}
[/sourcecode]
Add a defer and break
Now add a defer, a return and a break to the loop and it starts to get interesting.
[sourcecode]
() -> () {
loop cond-a ? {
defer { [later-block] }
[loop-block-a]
do cond-b ? {
break;
}
do cond-c ? {
return;
}
[loop-block-b]
}
[post-block]
}
[/sourcecode]
Along the top of the blocks, the flow is quite simple to follow. Each condition introduces a conditional branch. The tricky part happens at ‘later-block’: there are three incoming flows and three outgoing flows. What this means is that the purple “break flow” coming into ‘later-block’ must exit via the same purple “break flow” path. The green “return flow” follows the green path, and the the black “sequential flow” follows the black path.
At a high-level this is easy, but we have to consider the instruction level now. That single ‘later-block’ requires a way to switch to the three different output paths. For this we add a ‘switch’ branching condition (in LLVM this is an actual SwitchInst). Since this is a basic block, the switch must be the last instruction and can be the only branching instruction in the block.
The ‘switch’ instruction needs a condition variable: the value which decides which branch to take. This obviously can’t be set inside the ‘post-block’, which means the previous branching blocks must set the value. This is where the diagram is hiding something. The previous blocks can’t set the value, since after the block terminator (the branch) they can’t issue any more instructions. This means the true graph must have blocks in between the conditionals and ‘post-block’.
The ‘flow’ is an actual local variable for the function being compiled. ‘loop-block-b’ doesn’t need to set the flow since it is set in the first block. I discussed before that an optimizer can also restructure the blocks, and here it has another opportunity to do so. Perhaps this is a highly critical loop, or ‘later-block’ is doing something quite trivial. The optimizer could decide this ‘flow’ variable isn’t worth it and restructure the blocks by duplicating the ‘later-block’.
Exception handling
Throwing an exception just adds another path like “break” or “return” and is not interesting on its own. The moment an exception is caught however, the block structure bloats. What happens is that each function which could throw an exception technically becomes a branching instruction (in the eyes of a basic block model). This means that every call which might need to be handled is represented as a block branch.
In the current Leaf syntax “on any” is a generic catch-all error handler (they aren’t strictly exceptions, but the same block flow applies).
[sourcecode]
() -> () {
do {
call_a();
call_b();
} on any {
call_c();
call_d();
}
[post-block]
}
[/sourcecode]
Any exception occuring on the first two calls will be caught and flow passes to our handler. The calls in the handler however are not caught, so if they throw, the exception propagates out of this function. A very important path of interest is within the exception handler. Notice how the path between ‘call_c’, ‘call_d’ and ‘post-block’ is a normal flow arrow. The handler has effectively handled the exception, thus it reverts to normal flow. That leads us to the final topic.
Diverted Flow
The last example indicates the flow changed back to normal sequential flow in the handler. A similar, though temporary change, can also happen in any ‘defer’ block. Again, these are blocks that are executed as scopes are exited. During the execution of a defer, the flow must be sequential, but once it is done it must revert to whatever flow existed prior to entry.
For an example consider this code with a ‘break’ statement and a ‘defer’. For simplicity no exceptions will be thrown.
[sourcecode]
()->() {
loop cond-a ? {
defer {
call_a
call_b
};
do cond-b ? {
break;
};
};
}
[/sourcecode]
Every exit from the loop here goes through the ‘defer’ code. Between ‘call_a’ and ‘call_b’, the flow is always sequential. However, how we entered ‘call_a’ determines how we leave the ‘call_b’ block. As previously with the multiple exits, this actaully requires a flow variable and ‘break’ would need another block which sets that variable. I simpify in order to show the essence of what happens, not the lowest-level details.
Blocking Up
This was just meant as a general introduction to blocks. They are a method by which high-level code is converted to a lower level suitable for optimization. The scenarios covered are quite basic but cover the vast majority of all flows. The fundamental flows which are not covered are ‘continue’ and ‘goto’. ‘continue’ is handled similarly to ‘break’ but goes back to the loop condition at the end. ‘goto’ is a nasty flow, and I’ll just leave it at that for now.
Not many functions are ever as simple as these diagrams. Functions which use a variety of flow constructs, especially those with ‘defer’, ‘throw’ and ‘break’ constructs in combination, and even worse, requiring diverted flow, result in confounding spaghetti like diagrams. I’m currently in the process of debugging such flows in my Leaf compiler. Writing this article was an escape, at least temporarily, from the madness.