Tags

, , , ,

While working on Cloverleaf I came upon a problem of parsing custom operators. I would like to say that a language would allow custom infix operators, to be perfectly flexible and adaptable. Yet it poses a particular challenge for parsing: you cannot produce a proper syntax tree without tracking semantic information.

The Problem

Consider this simple arithmetic expression involving a few variables and a few common infix operators.

a + b * c

The syntax tree for this expression is:

[+[a][*[b][c]]]

Now suppose we allow custom operators to appear in a language. It doesn’t matter whether these are denoted with symbols or words, the result is the same.

a ⊕ b ⊗ c

This has two possible syntax trees:

[⊕[a][⊗[b][c]]] [⊗[⊕[a][b]][c]]

Without knowing the precedence and associativity of these operators we cannot tell which operation should be performed first.

Solutions

No Precedence

Smalltalk takes the approach of simply ignoring all common precedence rules and just evaluating left to right. Lisp has no notion of precdence since basically you have to build your syntax trees on your own. Both of these I consider bad options. While they simplify the life of the parser they put more work on the programmer and obscure the meaning of the code.

Keep in mind that precedence rules apply to all operators, not just the math ones. Things like function call syntax, member variable access, comparison operators, derefencing, and even assignment are all just operators with are handled according to some fixed precedence rules. C laid the basic groundwork for these rules, and most imperative languages hold to this ordering to some degree.

Not using precedence rules is simply not an option. Technically you can never be free of it since all language symbols have some kind of precedence. Yes, even in Smalltalk and LISP.

Semantic Aware Parsing

If you knew the precedence rules of all of these operators it wouldn’t be a problem to parse them correctly. This however means your parser has semantic knowledge — it has to understand previous pieces of the code in order to parse future pieces. If you’ve written parsers before, especially with a parser generator, you know how annoying and troublesome this can be.

Annoyance aside, a structural problem with semantic aware parsing is that it implies an ordering requirement on the parsing. Given blocks of code A and B it becomes important in which order those blocks are parsed. This basically forces the header file model on a language, carrying all the baggage of such a model. It also makes it difficult to write tools which are strictly syntax aware. Something simple like reformatting a file now requires a parser which can fully understand the language rather than just its syntax.

Though it does allow a nice operator syntax, the negative consequences don’t sound very attractive. Perhaps there is an option where only a limited amount ordering is required; perhaps a pre-parsing phase which defines operators.

No Custom Infix Operators

This essentially becomes the compromise option. The infix operators are hard-coded. While you can override their definitions, the precendence is fixed and cannot be adjusted. No new infix operators can be created. It is not a very flexible model, but it keeps the parsing simple.

There is a potential clarity benefit to this approach as well. You don’t want to have to be guessing at what order an expression is evaluated. I already get lost with C++ syntax at times, sticking in more parenthesis than are actually necessary. If we were to allow arbitrary infix operators the situation could get worse.

I like flexibility, but I also like clarity, so I’m not sure of my position on this option. Though it is an easy option to implement at first and will likely be the approach I take in Cloverleaf.

Hybrid Solution

There is a possibility that only custom operators need follow the rule of zero precedence. Other operators have normal precedence and all custom operators are zero precedence with left-associativity. There is merit in this approach since a lot of custom operators may not participate in complex expressions. Sometimes an infix operator is easier to read than a function call. Consider a few set comparisons:

if( data contains x ) ...
 if( first precedes last ) ...

I find this reads better than the functional approach as it is clearer which role each variable plays in the comparison.

if( contains( data, x ) ) ...
 if( precedes( first, last ) ) ...

Traditionally OOP has said the approach to this situation is a member function.

if( data.contains( x ) ) ...
 if( first.precedes( last ) ) ...

There are serious limitations here. C++11 has acknowledged this by introducing a series of free functions to replace member functions. Member functions requires you to modify the definition of a particular object, even if the behaviour is not fundamental to that object. It also prevents you from writing parameterized functions which work on multiple data types. So while the member function approach does read well, it isn’t sufficient in practice.

Ideal

I’m uncertain as to what the ideal solution is. Semantic aware parsing certainly offers the best final syntax, but at a significant cost. No precedence results in poor readability, as does no custom operators, but to a lesser degree. The hybrid solution does seem workable, but with a very limited allowance of the behaviour of custom operators.

Perhaps there is no ideal here, only compromise.