Can the Ideal language allow custom operators?

While working on Cloverleaf I came upon a problem of parsing custom operators. I would like to say that a language would allow custom infix operators, to be perfectly flexible and adaptable. Yet it poses a particular challenge for parsing: you cannot produce a proper syntax tree without tracking semantic information.

The Problem

Consider this simple arithmetic expression involving a few variables and a few common infix operators.

a + b * c

The syntax tree for this expression is:

[+[a][*[b][c]]]

Now suppose we allow custom operators to appear in a language. It doesn’t matter whether these are denoted with symbols or words, the result is the same.

a ⊕ b ⊗ c

This has two possible syntax trees:

[⊕[a][⊗[b][c]]] [⊗[⊕[a][b]][c]]

Without knowing the precedence and associativity of these operators we cannot tell which operation should be performed first.

Solutions

No Precedence

Smalltalk takes the approach of simply ignoring all common precedence rules and just evaluating left to right. Lisp has no notion of precdence since basically you have to build your syntax trees on your own. Both of these I consider bad options. While they simplify the life of the parser they put more work on the programmer and obscure the meaning of the code.

Keep in mind that precedence rules apply to all operators, not just the math ones. Things like function call syntax, member variable access, comparison operators, derefencing, and even assignment are all just operators with are handled according to some fixed precedence rules. C laid the basic groundwork for these rules, and most imperative languages hold to this ordering to some degree.

Not using precedence rules is simply not an option. Technically you can never be free of it since all language symbols have some kind of precedence. Yes, even in Smalltalk and LISP.

Semantic Aware Parsing

If you knew the precedence rules of all of these operators it wouldn’t be a problem to parse them correctly. This however means your parser has semantic knowledge — it has to understand previous pieces of the code in order to parse future pieces. If you’ve written parsers before, especially with a parser generator, you know how annoying and troublesome this can be.

Annoyance aside, a structural problem with semantic aware parsing is that it implies an ordering requirement on the parsing. Given blocks of code A and B it becomes important in which order those blocks are parsed. This basically forces the header file model on a language, carrying all the baggage of such a model. It also makes it difficult to write tools which are strictly syntax aware. Something simple like reformatting a file now requires a parser which can fully understand the language rather than just its syntax.

Though it does allow a nice operator syntax, the negative consequences don’t sound very attractive. Perhaps there is an option where only a limited amount ordering is required; perhaps a pre-parsing phase which defines operators.

No Custom Infix Operators

This essentially becomes the compromise option. The infix operators are hard-coded. While you can override their definitions, the precendence is fixed and cannot be adjusted. No new infix operators can be created. It is not a very flexible model, but it keeps the parsing simple.

There is a potential clarity benefit to this approach as well. You don’t want to have to be guessing at what order an expression is evaluated. I already get lost with C++ syntax at times, sticking in more parenthesis than are actually necessary. If we were to allow arbitrary infix operators the situation could get worse.

I like flexibility, but I also like clarity, so I’m not sure of my position on this option. Though it is an easy option to implement at first and will likely be the approach I take in Cloverleaf.

Hybrid Solution

There is a possibility that only custom operators need follow the rule of zero precedence. Other operators have normal precedence and all custom operators are zero precedence with left-associativity. There is merit in this approach since a lot of custom operators may not participate in complex expressions. Sometimes an infix operator is easier to read than a function call. Consider a few set comparisons:

if( data contains x ) ...
 if( first precedes last ) ...

I find this reads better than the functional approach as it is clearer which role each variable plays in the comparison.

if( contains( data, x ) ) ...
 if( precedes( first, last ) ) ...

Traditionally OOP has said the approach to this situation is a member function.

if( data.contains( x ) ) ...
 if( first.precedes( last ) ) ...

There are serious limitations here. C++11 has acknowledged this by introducing a series of free functions to replace member functions. Member functions requires you to modify the definition of a particular object, even if the behaviour is not fundamental to that object. It also prevents you from writing parameterized functions which work on multiple data types. So while the member function approach does read well, it isn’t sufficient in practice.

Ideal

I’m uncertain as to what the ideal solution is. Semantic aware parsing certainly offers the best final syntax, but at a significant cost. No precedence results in poor readability, as does no custom operators, but to a lesser degree. The hybrid solution does seem workable, but with a very limited allowance of the behaviour of custom operators.

Perhaps there is no ideal here, only compromise.

13 Comments

  1. guscost

    Wonderful topic – I love how this stuff seems to be at some ridiculously important triple point where engineering, cybernetics and philosophy all come together. More than ever before, it is irresponsible and dangerous to disregard any of these perspectives when designing an important system, so it is certainly reassuring to come across such a good explanation of the problem.

  2. Rob Grainger

    Haskell does quite a good job here of defining operators that can have both associativity and precedence specified. Personally, I’m in favour of fully definable operators. However, this will probably need management with some form of module – so that “operators” get imported from the module that defines them – with all the normal requirements for collision prevention.

    Consider what happens when “+” is overloaded for both addition (numeric) and concatenation (strings). Should these two have the same associativity and precedence? We’re all used to this from C++, but is that ideal.

    Similarly, in C-based languages, “^” is generally xor, but in mathematically oriented languages may be more natural for exponentiation. It would be nice if both could be used – depending on the modules/llbraries imported.

    That said, I’ve been wrestling with many of these ideas for years and haven’t yet found an “Ideal” solution, just thought I’d throw considerations into the mix.

    1. mortoray

      I’m actually quite against operators having a different meaning in different contexts. Something like ‘+’ in my opinion should mean only numeric addition, making it also do string concatenation introduces ambiguity.

      However, like you said, I think perhaps some kind of “module” should be able to say what operators do what. So that in the context of the module you can change what ‘+’ means, but it’d still be consistent for the entire module.

  3. ghffgh

    Rob oversimplified a bit. “+” in Haskell is not overloaded for strings. It is defined for Num typeclass, that models rings (so you can add integers, floats, rationals, complex numbers, matrices, polynomials etc.). I programmed a lot in Haskell, and it’s useful to define your own operators. This way the language is extensible: you can create your own programming abstractions.

    To parse such operators, you just intially do a single pass for infix declarations in a module, and then add them to the parser. There are nine (ten?) possible precedences, which is enough. Order of infix declarations does not matter, since an operator can have only one precedence.

  4. dcow

    You must not be familiar with Prolog. In Prolog, like Lisp, operators are simply syntactic sugar for functions. Operators can be used infix, prefix, and/or postfix, or can be used in a standard function notation. In fact, I do not believe any of Prolog’s operators are ‘hardcoded’ in the sense that they are not accessible through the language. Rather, there are defaults for things people commonly do, such as standard arithmetic or logical operations. Prolog has a precedence for all operators, including custom operators, therefore, you could clear out all operators and redefine them yourself to behave exactly as they already do — or you could redefine them to do whatever you want. You could write an operator-free program, or define custom operators to the point where your program reads like plain english. Pretty cool! I’ll concede that Prolog’s solution where precedence is a pretty arbitrary number is rather unsightly. But, it’s by no means a deal breaker and does work quite well.

    1. mortoray

      I am only lightly familiar with Prolog, and just read a bit more on its operators now. This would fall into the domain of “semantic aware parsing” as far as I can tell. Though this solves the problem of operator precedence it carries with it other problems. I’m not saying it doesn’t work, as C++ also requires semantic aware processing and “works”, just that it isn’t an ideal solution. I would much prefer a language where the syntax-tree can be fully formed without any semantic knowledge.

  5. igmaiz

    I not agree with the conclusions of the author.
    IMHO operators without precedence should be the way to go (among other things like fewer primitive types, and more operators (unicode. ascii is a very small set of characters and it is obsolete).
    Rely on operator precedence makes bad programmers to avoid clarifiers (like parenthesis) making the expression less clear.

    int n = 3 + 4 * 5; // 35 or 23 ?
    int m = (3 + 4) * 5; // 35 without any doubt

    The difference? only two characters (the clarifiers).
    Benefits ? you don’t have to know hard-coded parser rules and the code is unambiguous. Explicit things are always more clear.
    Drawbacks ? programmers who love to make unreadable code are done :)

    Now, imagine a big math form with several operators, but without parenthesis:

    double d = 0.5 + 4 * n % 4 ^ m ….; // you need to examine the form with care

    The same happens to languages that don’t require block markers like {} or begin/end and the ones who don’t have end-of-sentence markers like semicolon:

    int n = a
    + b
    ++c

    This means:
    a) int n = a + b syntax error +c;
    b) int n = a; syntax error ++b; ++c;
    c) int n=a+b; ++c;

    This other expression with clarifiers is unambiguous: int n = (a+b); ++c;

    Strictly talking, operators are not needed. They are “syntactic sugar” (as dcow said) but they make long expressions less verbose and therefore easier to understand (with clarifiers).

    xPrefix (A) = x A
    xPostfix(A) = A x
    xBinary (A,B) = A x B
    xyNary (A,B,C) = A x B y C

    It’s true that operator addition/overloading, when missused, can be a source of problems, but that also happens with the rest of the language.
    I’ve seen a lot of bad human made translations from C to Java because the programmer simply doesn’t know the spirit of OOP and as result you have messy instead of elegant code. The same happens with operators.

    1. mortoray

      You have to consider non-numeric operators as well. If there were no precedence than simple expressions become syntactically bloated. Consider a simple example using a field/member access operator, subscript operator, function call operators, along with math operators.

      //original
      obj.get() + arr[b+c] + d
      
      //without precedence
      ((((obj.get)()) * (arr[(b+c)])) + d)
      

      Given common precedence rules the first one is quite easy to read and doesn’t have any ambiguity. The second one is hard to decipher and I guarantee will lead to more defects rather than less.

    2. igmaiz

      Drawbacks of operator overloading/addition (assuming OOP context):

      – Loss of standard
      + You can transform the language into another completely different adding a bunch of new operators or overriding existant ones.
      + There are (with ASCII) few standardized operators: +, -, *, /, <=, , >=

      – Loss of simplicity
      + Too many operators can make the language hard to understand

      – Loss of code reusing:
      + The code would be project-dependant due to the classes using the operator with user-defined rules.

      – When porting the code to another language…
      + If target language doesn’t allow operator overload/addition you must replace all the operators with function calls.

      – The parser and compiler would be more complex
      + adding syntax alias for certain function calls (operator addition)
      * static type-check (compile-time) (addition/overloading)
      * dynamic type-check (run-time) (addition/overloading)

      – What happen with descendant classes?
      + inherit the operator for that method or not ?
      + can an overriden/virtual method define a new operator ? -> Confusing code

  6. igmaiz

    mortoray said:

    //original
    obj.get() + arr[b+c] + d

    //without precedence
    ((((obj.get)()) * (arr[(b+c)])) + d)

    I think you set too innecessary parenthesis. Without precedence, expression parsed left to right:

    obj.get() * (arr[b+c] + d)

    – obj.get() is a call to VTable (if you don’t overload the period, no need of extra parenthesis, parsed left to right)
    – * means operator
    – [b+c] doesn’t needs clarifiers because the indexer [] has two tokens
    setting the expression limits and therefore arr[b+c] is clear
    – + means operator
    – d

    1. mortoray

      This is not “no precedence” then. It is left-to-right precedence with some tokens having higher priority. This is no different than having no custom operators and removing the precedence priority of the basic arithmetic operators.

  7. gkya

    Although I’m not very knowledgeable about (though willing to study) parsing, I can tell that Scala has a nice way of handling user defined operators. See
    and 6.12 in :

    > 6.12.3 Infix Operations
    > An infix operator can be an arbitrary identifier. Infix operators have precedence and
    > associativity defined as follows:
    >
    > The precedence of an infix operator is determined by the operator’s first character.
    > Characters are listed below in increasing order of precedence, with characters on
    > the same line having the same precedence.
    >
    > (all letters)
    > |
    > ^
    > &
    >
    > = !
    > :
    > + –
    > * / %
    > (all other special characters)
    >
    > That is, operators starting with a letter have lowest precedence, followed by
    > operators starting with ‘|’, etc.

    1. mortoray

      That’s an interesting approach to establishing precedence. The parser can figure it out without knowing the operator and it keeps everything consistent.

      I’ll definitely ensure the operators I have in Leaf follow such a pattern.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s