Tags
antlr, bison, lexing, parser, parser generator, parser generators, Programming, yacc
Parser generators, like ANTLR or Bison, seem like great tools. Yet when I have to write a parser I now tend to steer clear of them, resorting to writing one manually. When I posted my first entry about Cloverleaf I was asked why I don’t use these tools. It’s a fair question as these tools take care of lot of work involved in parsing. In theory they are great tools. In practice I find a lot to be desired and end up fighting with the tool more than using it.
Lexing and Context
One key aspect that bothers me with many of the tools is the requirement to have a distinct first lexing phase. What this means is that your source is first converted to a stream of tokens and then the tree generator can work from this stream. This is a fast approach to parsing. However it has a serious limitation that it requires your tokens to be completely context free.
At first this may not sound too bad, but it quickly complicates the parsing part. This is especially true of domain specific languages where it is very convenient to vary tokens based on context. Consider a very simple example using this text file:
Name Edaqa Age 37 Group 15-B Phone +49.(0).123.456
Each line has a tag which identifies what type of data that follows it. If you had to lex this first you wouldn’t be able to come up with a satisfactory way to handle the Age, Group, and Phone numbers. You’d be forced to accept a more generic string and post-parse it after the tree is generated. This to me doesn’t seem like a good approach; I find context-free lexing to be a serious limitation on parsing.
Dynamic lexer tokens are also problematic to support. This may sound unnecessary, but is surprisingly common. Consider something as simple as the Bash or Perl heredoc “<<END”, or C++11′s raw string support. These require the lexer to use part of the opening token as the terminator to the token. Indeed, to just identify the start of the heredoc notation could require semantic information.
Shift/Reduce and Grammar Conflicts
From my time using YACC I will never forget shift-reduce errors. While ANTLR did improve greatly what is accepted, I still found myself faced with conflicts, or grammars which simply did not parse the way I intended. These generators expect your grammar to be written in a particular way, and unless you can think like the generator you’re bound to have problems. The limitations of what the generator can understand are not theoretical limitations, they can affect even the most mundane of grammars.
This is the primary area where fighting with the generator occurs. In some cases the reported error was correct, there was indeed a conflict in the grammar. However, in many cases there was no ambiguity. Or rather, the source language I was trying to parse had an unambiguous parsing solution. It was at times extremely difficult to convince the generator to parse it the way I wanted.
On a few occasions I was never able to convince the generator of what I wanted and had to alter the syntax of the language to accommodate it. This is terrible. The language should not be altered to fit the whims of a particular parsing tool. Sure, you do have to worry about having an ambiguous language and an efficient parser, but there are many ways to do this without having to break the language.
Syntax Tree
What comes out of parser generator code is an abstract syntax tree that follows the grammar you have entered. Usually this is not the exact syntax tree you wish to have. Instead you’d like to reorder nodes, collapse a few, and expand others. Changing the tree structure can greatly reduce the burden of further processing.
When I used ANTLR for one of my projects I was happy to discover that tree reordering was supported. This is definitely a very useful feature. Still, I felt that it was limited and had troubles getting some of the structures which I actually wanted. I really don’t want to need a post-processing phase which massages the resulting tree. In a hand-coded parser it is relatively straight forward to modify the tree structure to whatever you want at parsing time.
Mixed Code
At first the promise of having a pure grammar seems really appealing. You don’t have to worry about target language constructs and ideally multiple projects could use the grammar in a variety of languages. The trouble is that ultimately all my grammars have not been functional without adding target language code. There are just too many things the generators don’t handle. Your grammar ends up being filled up with C, C++, Java, or whatever language you are targeting.
The idea of mixing code may also seem like a good idea. The trouble is that you only have bits and pieces and it is rather disconnected from the wrapping code. I have a hard time determining exactly what is happening — what is the context of all this code, or what scope do the variables have? Not only is the target language code not clear, the original grammar code is also so cluttered that it also isn’t clear anymore. Plus you’ve created such a tight bond between the grammar and the target language that maintenance becomes problematic.
Other Limitations
Quite often you’ll have sections of text which just need to be parsed differently than other sections. This goes beyond simple lexing changes. For example, if you are doing a C compiler you will also want to support inline assembly, which has an entirely different syntax. This seems completely antagonistic to most parser generators I’ve seen.
Getting location information is a hassle, and never seems natively supported. In particular, when a parsing error does occur, it is hard to get the parser to indicate where in the source file that it failed. Often the structure of the parser, whatever mechanism it uses, prevents it from identifying the error in a reasonable location (it has simply tried another rule instead of indicating an error). Even when it does parse it can be difficult to get the line and column number.
So I don’t use them
I don’t doubt that an expert in a particular generator would be able to overcome at least a few the issues I have. I’m not sure I should have to be an expert though to gain access to what I consider as fundamental, or necessary features. I always consider the goal of a domain specific tool, which these are, is to make my job easier, not harder. So far I haven’t found one where that is the case (I’ve been through Yacc, Bison and ANTLR v3).
It is with a bit of regret that I don’t use such generators. They have some nice technology in them and can generate fast parsers. Each time I sit down and start writing another map of regexes and switch blocks I start longing for a better way. Unfortunately, a simple recursive descent parser has always served me well.
Recursive descent means you have to re-write significant chunks of your parsing code when you make changes to your language, no?
Also, this raises the question: How are existing major languages parsed? I thought C and C++ used one of the gnu parses.. No? Due to a typo I now want to create a project called ngu
Recursive descent makes reworking a language very straight-forward. You usually only need to rework a few functions to support the new sytnax, either a new token, or a new structure at that point. You change about as much as the code as the change in the language you’ve made.
I don’t know what most compilers use, but I’d suspect they don’t use generators. Primarily since generators which can even handle something like C++ are still quite new (though it is still a challenge).
What about libraries like PEGTL that allow the grammar to be embedded in the source?
I’ve not had a lot of experience with these (Boost Spirit being the one I briefly looked at). On the surface they might address the mixed code issue, but when I look at examples that doesn’t appear to be the case. The C++ template stuff is so different from normal C++ it’s almost a different language — but it actually isn’t, so that’s a plus.
They also wouldn’t make the grammar language indenpendent.
Whether they satisfy the needs for lexer freedom and syntax trees is hard to tell without an in-depth look. There is nothing ineherent about them being source code based that would address these issues.
Another benefit to rolling your own parser is that you can have much more precise control over the error messages generated in the event of syntax errors. A parser generator can only know so much about the intent of your grammar, and hence can only offer a limited degree of user-friendly diagnostics.
Agreed. I think this was one problem I had once. The generator couldn’t decide whether the rule just didn’t match (and thus must backtrack) or the user had made a syntax error.
Thanks for this. For one of my projects I hand-code three parsers (nothing fancy, but not trivial) and I’ve always had a nagging fear someone examining my code would think I was being dumb for doing it the hard way.
For one of the parsers it isn’t line and column I need, but byte offset into the source file.
For the other parsers, I rolled my own because frankly it just seemed easier. I like being able to debug the parsing directly instead of debugging the generating of the parser.
Pingback: Making a programming language: Part 3 – adding features | My programming escapades
When writing a parser for language with a more or less stable grammar, one can afford coding the parser by hand (mostly for performance reasons). Howevwer, such parser is coded once and for all.
But there’re projects where you have to use parser generators, for your grammar is changing from day to day (or becomes totally new, or you have too many grammars).
I completely agree that it’s hard and inconvenient to use parser generators, and you may spend more time resolving conflicts than hand-coding parser, and the resulting parser may be slow and ugly.
The only case I would ultimately prefer parser generator is parsing regular expressions with RE2C.
My experience has been that changing a well written hand-made parser to be far easier than maintaining a parser generator file. Each time I’ve used a generator I end up hitting a point where I cannot remove or modify the grammar easily without breaking.
Rewriting parts of recursive descent parser, when language or format it conforms to has changed a bit, may be a lot easier, then modifying parser generator file.
But it 1) you have to totally rewrite your parser every day (as a planned part of developing process) 2) grammar changes in layout, not in complexity, it may be far easier to rewrite grammar file.
IMHO, parser generators are ecpecially useful for parsing regexps: it’s easy to define regexps, and hard to build a complex DFA. And mostly you can’t use regexps for anything greater than lexing, so you won’t face conflicts changing your grammar.
As part of my Leaf project I continually change parts of my grammar. Though perhaps for highly regular languages, those which are nearly context-free and require no look-ahead a generator may be better. But even for something like math expression evaluation I’d still choose hand-written.