Tags

, , , , , , ,

Parser generators, like ANTLR or Bison, seem like great tools. Yet when I have to write a parser I now tend to steer clear of them, resorting to writing one manually. When I posted my first entry about Cloverleaf I was asked why I don’t use these tools. It’s a fair question as these tools take care of lot of work involved in parsing. In theory they are great tools. In practice I find a lot to be desired and end up fighting with the tool more than using it.

Lexing and Context

One key aspect that bothers me with many of the tools is the requirement to have a distinct first lexing phase. What this means is that your source is first converted to a stream of tokens and then the tree generator can work from this stream. This is a fast approach to parsing. However it has a serious limitation that it requires your tokens to be completely context free.

At first this may not sound too bad, but it quickly complicates the parsing part. This is especially true of domain specific languages where it is very convenient to vary tokens based on context. Consider a very simple example using this text file:

Name Edaqa
Age 37
Group 15-B
Phone +49.(0).123.456

Each line has a tag which identifies what type of data that follows it. If you had to lex this first you wouldn’t be able to come up with a satisfactory way to handle the Age, Group, and Phone numbers. You’d be forced to accept a more generic string and post-parse it after the tree is generated. This to me doesn’t seem like a good approach; I find context-free lexing to be a serious limitation on parsing.

Dynamic lexer tokens are also problematic to support. This may sound unnecessary, but is surprisingly common. Consider something as simple as the Bash or Perl heredoc “<<END”, or C++11’s raw string support. These require the lexer to use part of the opening token as the terminator to the token. Indeed, to just identify the start of the heredoc notation could require semantic information.

Shift/Reduce and Grammar Conflicts

From my time using YACC I will never forget shift-reduce errors. While ANTLR did improve greatly what is accepted, I still found myself faced with conflicts, or grammars which simply did not parse the way I intended. These generators expect your grammar to be written in a particular way, and unless you can think like the generator you’re bound to have problems. The limitations of what the generator can understand are not theoretical limitations, they can affect even the most mundane of grammars.

This is the primary area where fighting with the generator occurs. In some cases the reported error was correct, there was indeed a conflict in the grammar. However, in many cases there was no ambiguity. Or rather, the source language I was trying to parse had an unambiguous parsing solution. It was at times extremely difficult to convince the generator to parse it the way I wanted.

On a few occasions I was never able to convince the generator of what I wanted and had to alter the syntax of the language to accommodate it. This is terrible. The language should not be altered to fit the whims of a particular parsing tool. Sure, you do have to worry about having an ambiguous language and an efficient parser, but there are many ways to do this without having to break the language.

Syntax Tree

What comes out of parser generator code is an abstract syntax tree that follows the grammar you have entered. Usually this is not the exact syntax tree you wish to have. Instead you’d like to reorder nodes, collapse a few, and expand others. Changing the tree structure can greatly reduce the burden of further processing.

When I used ANTLR for one of my projects I was happy to discover that tree reordering was supported. This is definitely a very useful feature. Still, I felt that it was limited and had troubles getting some of the structures which I actually wanted. I really don’t want to need a post-processing phase which massages the resulting tree. In a hand-coded parser it is relatively straight forward to modify the tree structure to whatever you want at parsing time.

Mixed Code

At first the promise of having a pure grammar seems really appealing. You don’t have to worry about target language constructs and ideally multiple projects could use the grammar in a variety of languages. The trouble is that ultimately all my grammars have not been functional without adding target language code. There are just too many things the generators don’t handle. Your grammar ends up being filled up with C, C++, Java, or whatever language you are targeting.

The idea of mixing code may also seem like a good idea. The trouble is that you only have bits and pieces and it is rather disconnected from the wrapping code. I have a hard time determining exactly what is happening — what is the context of all this code, or what scope do the variables have? Not only is the target language code not clear, the original grammar code is also so cluttered that it also isn’t clear anymore. Plus you’ve created such a tight bond between the grammar and the target language that maintenance becomes problematic.

Other Limitations

Quite often you’ll have sections of text which just need to be parsed differently than other sections. This goes beyond simple lexing changes. For example, if you are doing a C compiler you will also want to support inline assembly, which has an entirely different syntax. This seems completely antagonistic to most parser generators I’ve seen.

Getting location information is a hassle, and never seems natively supported. In particular, when a parsing error does occur, it is hard to get the parser to indicate where in the source file that it failed. Often the structure of the parser, whatever mechanism it uses, prevents it from identifying the error in a reasonable location (it has simply tried another rule instead of indicating an error). Even when it does parse it can be difficult to get the line and column number.

So I don’t use them

I don’t doubt that an expert in a particular generator would be able to overcome at least a few the issues I have. I’m not sure I should have to be an expert though to gain access to what I consider as fundamental, or necessary features. I always consider the goal of a domain specific tool, which these are, is to make my job easier, not harder. So far I haven’t found one where that is the case (I’ve been through Yacc, Bison and ANTLR v3).

It is with a bit of regret that I don’t use such generators. They have some nice technology in them and can generate fast parsers. Each time I sit down and start writing another map of regexes and switch blocks I start longing for a better way. Unfortunately, a simple recursive descent parser has always served me well.

About these ads