Programming

Why I don’t use a Parser Generator

Parser generators, like ANTLR or Bison, seem like great tools. Yet when I have to write a parser I now tend to steer clear of them, resorting to writing one manually. When I posted my first entry about Cloverleaf I was asked why I don’t use these tools. It’s a fair question as these tools take care of lot of work involved in parsing. In theory they are great tools. In practice I find a lot to be desired and end up fighting with the tool more than using it.

Lexing and Context

One key aspect that bothers me with many of the tools is the requirement to have a distinct first lexing phase. What this means is that your source is first converted to a stream of tokens and then the tree generator can work from this stream. This is a fast approach to parsing. However it has a serious limitation that it requires your tokens to be completely context free.

At first this may not sound too bad, but it quickly complicates the parsing part. This is especially true of domain specific languages where it is very convenient to vary tokens based on context. Consider a very simple example using this text file:

Name Edaqa
Age 37
Group 15-B
Phone +49.(0).123.456

Each line has a tag which identifies what type of data that follows it. If you had to lex this first you wouldn’t be able to come up with a satisfactory way to handle the Age, Group, and Phone numbers. You’d be forced to accept a more generic string and post-parse it after the tree is generated. This to me doesn’t seem like a good approach; I find context-free lexing to be a serious limitation on parsing.

Dynamic lexer tokens are also problematic to support. This may sound unnecessary, but is surprisingly common. Consider something as simple as the Bash or Perl heredoc “<<END”, or C++11’s raw string support. These require the lexer to use part of the opening token as the terminator to the token. Indeed, to just identify the start of the heredoc notation could require semantic information.

Shift/Reduce and Grammar Conflicts

From my time using YACC I will never forget shift-reduce errors. While ANTLR did improve greatly what is accepted, I still found myself faced with conflicts, or grammars which simply did not parse the way I intended. These generators expect your grammar to be written in a particular way, and unless you can think like the generator you’re bound to have problems. The limitations of what the generator can understand are not theoretical limitations, they can affect even the most mundane of grammars.

This is the primary area where fighting with the generator occurs. In some cases the reported error was correct, there was indeed a conflict in the grammar. However, in many cases there was no ambiguity. Or rather, the source language I was trying to parse had an unambiguous parsing solution. It was at times extremely difficult to convince the generator to parse it the way I wanted.

On a few occasions I was never able to convince the generator of what I wanted and had to alter the syntax of the language to accommodate it. This is terrible. The language should not be altered to fit the whims of a particular parsing tool. Sure, you do have to worry about having an ambiguous language and an efficient parser, but there are many ways to do this without having to break the language.

Syntax Tree

What comes out of parser generator code is an abstract syntax tree that follows the grammar you have entered. Usually this is not the exact syntax tree you wish to have. Instead you’d like to reorder nodes, collapse a few, and expand others. Changing the tree structure can greatly reduce the burden of further processing.

When I used ANTLR for one of my projects I was happy to discover that tree reordering was supported. This is definitely a very useful feature. Still, I felt that it was limited and had troubles getting some of the structures which I actually wanted. I really don’t want to need a post-processing phase which massages the resulting tree. In a hand-coded parser it is relatively straight forward to modify the tree structure to whatever you want at parsing time.

Mixed Code

At first the promise of having a pure grammar seems really appealing. You don’t have to worry about target language constructs and ideally multiple projects could use the grammar in a variety of languages. The trouble is that ultimately all my grammars have not been functional without adding target language code. There are just too many things the generators don’t handle. Your grammar ends up being filled up with C, C++, Java, or whatever language you are targeting.

The idea of mixing code may also seem like a good idea. The trouble is that you only have bits and pieces and it is rather disconnected from the wrapping code. I have a hard time determining exactly what is happening — what is the context of all this code, or what scope do the variables have? Not only is the target language code not clear, the original grammar code is also so cluttered that it also isn’t clear anymore. Plus you’ve created such a tight bond between the grammar and the target language that maintenance becomes problematic.

Other Limitations

Quite often you’ll have sections of text which just need to be parsed differently than other sections. This goes beyond simple lexing changes. For example, if you are doing a C compiler you will also want to support inline assembly, which has an entirely different syntax. This seems completely antagonistic to most parser generators I’ve seen.

Getting location information is a hassle, and never seems natively supported. In particular, when a parsing error does occur, it is hard to get the parser to indicate where in the source file that it failed. Often the structure of the parser, whatever mechanism it uses, prevents it from identifying the error in a reasonable location (it has simply tried another rule instead of indicating an error). Even when it does parse it can be difficult to get the line and column number.

So I don’t use them

I don’t doubt that an expert in a particular generator would be able to overcome at least a few the issues I have. I’m not sure I should have to be an expert though to gain access to what I consider as fundamental, or necessary features. I always consider the goal of a domain specific tool, which these are, is to make my job easier, not harder. So far I haven’t found one where that is the case (I’ve been through Yacc, Bison and ANTLR v3).

It is with a bit of regret that I don’t use such generators. They have some nice technology in them and can generate fast parsers. Each time I sit down and start writing another map of regexes and switch blocks I start longing for a better way. Unfortunately, a simple recursive descent parser has always served me well.

35 replies »

  1. Recursive descent means you have to re-write significant chunks of your parsing code when you make changes to your language, no?

    Also, this raises the question: How are existing major languages parsed? I thought C and C++ used one of the gnu parses.. No? Due to a typo I now want to create a project called ngu :)

    • Recursive descent makes reworking a language very straight-forward. You usually only need to rework a few functions to support the new sytnax, either a new token, or a new structure at that point. You change about as much as the code as the change in the language you’ve made.

      I don’t know what most compilers use, but I’d suspect they don’t use generators. Primarily since generators which can even handle something like C++ are still quite new (though it is still a challenge).

    • I’ve not had a lot of experience with these (Boost Spirit being the one I briefly looked at). On the surface they might address the mixed code issue, but when I look at examples that doesn’t appear to be the case. The C++ template stuff is so different from normal C++ it’s almost a different language — but it actually isn’t, so that’s a plus.

      They also wouldn’t make the grammar language indenpendent.

      Whether they satisfy the needs for lexer freedom and syntax trees is hard to tell without an in-depth look. There is nothing ineherent about them being source code based that would address these issues.

  2. Another benefit to rolling your own parser is that you can have much more precise control over the error messages generated in the event of syntax errors. A parser generator can only know so much about the intent of your grammar, and hence can only offer a limited degree of user-friendly diagnostics.

    • Agreed. I think this was one problem I had once. The generator couldn’t decide whether the rule just didn’t match (and thus must backtrack) or the user had made a syntax error.

  3. Thanks for this. For one of my projects I hand-code three parsers (nothing fancy, but not trivial) and I’ve always had a nagging fear someone examining my code would think I was being dumb for doing it the hard way.

    For one of the parsers it isn’t line and column I need, but byte offset into the source file.

    For the other parsers, I rolled my own because frankly it just seemed easier. I like being able to debug the parsing directly instead of debugging the generating of the parser.

  4. When writing a parser for language with a more or less stable grammar, one can afford coding the parser by hand (mostly for performance reasons). Howevwer, such parser is coded once and for all.

    But there’re projects where you have to use parser generators, for your grammar is changing from day to day (or becomes totally new, or you have too many grammars).

    I completely agree that it’s hard and inconvenient to use parser generators, and you may spend more time resolving conflicts than hand-coding parser, and the resulting parser may be slow and ugly.

    The only case I would ultimately prefer parser generator is parsing regular expressions with RE2C.

    • My experience has been that changing a well written hand-made parser to be far easier than maintaining a parser generator file. Each time I’ve used a generator I end up hitting a point where I cannot remove or modify the grammar easily without breaking.

    • Rewriting parts of recursive descent parser, when language or format it conforms to has changed a bit, may be a lot easier, then modifying parser generator file.

      But it 1) you have to totally rewrite your parser every day (as a planned part of developing process) 2) grammar changes in layout, not in complexity, it may be far easier to rewrite grammar file.

      IMHO, parser generators are ecpecially useful for parsing regexps: it’s easy to define regexps, and hard to build a complex DFA. And mostly you can’t use regexps for anything greater than lexing, so you won’t face conflicts changing your grammar.

    • As part of my Leaf project I continually change parts of my grammar. Though perhaps for highly regular languages, those which are nearly context-free and require no look-ahead a generator may be better. But even for something like math expression evaluation I’d still choose hand-written.

  5. I’m not a compiler expert, though I have written a couple of parsers, sometimes for some quite complex grammars, ranging from recursive descent hand-written parsers, lex/yacc, Irony parsers (C# parser where the grammer is declared in the language), GoldParser, ANTLR v3 and ANTLR v4. While there are indeed some valid points described in this post, I would like to revisit this with ANTLR v4, as the new version is really able to tackle nicely and efficiently some of your concerns.

    1) Context lexer/parser: ANTLR v4 is coming with predicates to modify the grammar or behavior of the lexers and/or parsers based on dynamic contexts. You can even switch the grammar to contextual modes, to accept specifics tokens for part of the grammar at specifics points…etc. There is also the opportunity to develop a “filter” lexer on top of the existing generated lexer in order to generate/modify the flow of the tokens (for example, handle correctly indent grammars like Python)

    2) Shift/Reduce and Grammar Conflicts: definitely yes, either with lex/yacc or with ANTLR3, I had some serious fight sometimes to correctly handle a grammar. But ANTLR v4 is just fantastic for this, at it allows to use left recursive grammar (with a very little restriction). Take a look at the v4 ANTLR grammar for Java for example (https://github.com/antlr/grammars-v4/blob/master/java/Java.g4 ), the way to describe an expression is just ridiculously easy. So far, working with ANTLR v4, I have not been hit by anykind of grammar conflicts.

    3) Concerning syntax tree, I usually don’t use them directly in the language, as I’m always writing an AST that is independent of the parser. Afaik, ANTLR v4 doesn’t have syntax rewriting capabilities, but the new listener/visitor patterns make things quite easy to build a custom ast.

    4) Mixed code: I agree. But again, ANTLR v4 allows to write a grammar almost entirely decoupled from the generated code (look the Java grammar). You can still use custom embedded actions in ANTLR, but they are most of the time not necessary and can be done entirely outside the grammar definition. Predicates in the grammar are usually quite light and they don’t mess up readability of the grammar.

    5) Concerning parser context mode, as I said, It is working nicely in ANTLR v4. For source location, I have not experienced much of your problems with most of the parsers I have been using. I think that ANTLR v4 is quite good at hitting the right spot, column wise.

    Plus on the things that you don’t mention, and with ANTLR v4 in mind I would say:
    – Create a full effective grammar from scratch just takes a few hours for a quite complex language (all languages like C#, Java, Python…etc. are really easy to describe). You are much more productive this way.
    – Developing an efficient token lexer is quite a laborious and stupid work, having a parser that creates optimized DFA tables for this tasks greatly simplify this.
    – Error recovery is quite impressive in ANTLR v4. Achieving the same level of recovery (automatic token deletion, missing token insertion, continue to match rule list…etc) in a hand-written parser is quite difficult (more particularly with backtracks).
    – Developing an efficient hand-written parser is not always as easy as it seems. Particularly when you need to do some backtracks, performance of a recursive descent parsers can be badly screwed-up in this case when this is not correctly handled (and this is absolutely not a simple task). But worst, the code of a recursive descent parsers can become much more difficult to maintain and understand, because of all the grammar ambiguities and performances hacks needed to handle them.

    So for all these reasons, if you care about productivity, I strongly encourage developers to use a parser generator like ANTLR v4.

    I would still recommend developing some hand-written parsers in some specific cases like:
    – For learning how to write a parser :)
    – For performance reasons, though, quite often the bottleneck in a parser is the lexer. Naive hand-written lexers will most certainly perform much worse than an efficient generated DFA table lexer.

    • Productivity is a big reason why I don’t like the generators. When you first start a project you can put together a parser really fast (with ANTLR at least, other generators not so fast). But then you hit a complex point and spend many more hours trying to fix the generator than you would have writing your own code. Though granted if you’re not already experienced with writing parsers the generator approach will likely be faster.

      All you say about ANTLR4 sounds promising, but it reinforces my feeling that the generators are always playing catch-up. They always need new additions and extensions to handle new languages and language changes. It’s a legitimate fear that if you use a generator you’ll inevitably hit some syntax you wish to implement but can’t.

      With Leaf I’m cheating a bit. Since it’s a whole new language I’m ensuring the grammar is friendly to parse. It’s recursive descent but doesn’t require any back-tracking (well, debatable since I have a converter stage that actually rewrites one parse tree). Once I have a reasonably stable syntax I might try writing it in a generator. I suspect it’ll work in ANTLR4 based on your description.

  6. Take a look at CoCo/R. The produced code is readable, and traceable in the debugger. You resolve LL1 conflicts with ‘if’ statements. As far as not wanting to mix code and grammar — I understand. I like to keep the code embedded in the grammar as simple as possible, the real work is done elsewhere.

    • I’m not sure I want to resolve to the LL1 conflicts, I just want them not to be there. That is, in many cases I seem to get conflicts though my source language doesn’t appear to have them, and a recursive descent parser wouldn’t need the ‘if’ statement (the context only allows one possible answer).

    • Each tool I used was selected after considering many popular options. Each were considered good tools at the time. Should a new tool arrive that solves all of my problems that would be great! I still need parsers and would be happy not to write my own. My experience so far however tells me the likelihood of finding such a solution is still quite low.

  7. Have you considered using a parsing expression generator (PEG) parser? They inherently eliminate the separate lexing phase and shift/reduce conflicts, and in general work closer to a recursive descent parser that you would write by hand. I first discovered them a few months ago and have found them quite pleasant to work with. I’m using ‘pt’ (parse tools) from the tcl standard library to generate parsers on the fly or fast c-coded ones.

  8. For all the reasons you mention (and some more), I’ve built my own parser library (if you’re interested, it’s here: https://github.com/axilmar/parserlib).

    Highlights of my library: a) separation of grammar from AST, b) handling of left recursion.

    The parser is a PEG parser.

    • Since I started writing my Leaf compiler the list of things a generator would have to do for me has grown quite a bit. For example, I want to do expression parsing based on a generic syntax matched with a table of priorities for the operators. Would a PEG parser be capable of doing that, or would I have to hard-code the precedence rules in the grammar by making each operator an explicit rule?

  9. Just a small comment. Those shift/reduce (and more importantly, reduce/reduce) conflicts? Those are *bugs* in your parser. While I agree that fighting them is a pain, and can be particularly frustrating in an LR-style parser generator, shifting to another tool in order to avoid being *told* about your bugs isn’t a good response to those errors.

    No tool is perfect, but ANTLR is pretty good. Also, because it is in the LL family of parser generators, it’s generated code and it’s diagnostics tend to be closer to what the user naïvely expects, and therefore easier to correct.

    • I won’t disagree they are bugs in my parser. But they only exist because of the technology that was chosen. It’s very frustrating to have a grammar you know is parsable but can’t figure out how to massage it into the format of the parser generator.

      With recursive descent parsers written by hand I rarely have the issue of being unable to parse something the way I want.

  10. I agree with OP: you shouldn’t spend your time wrestling with your parser generator. I disagree that writing recursive descent parsers is the right answer for maintainable results.

    Semantic Designs’ (my company) DMS Software Reengineering Toolkit uses GLR parsers, GLR can parse any context-free language; it handles arbitrary lookahead and local ambiguity (the source of many shift/reduce, reduce/reduce issues with LR parsing) without any extra help. DMS also has very strong lexer that allows mutiple lexing modes, that can inquire about the parser state, or provide the GLR parser with multiple interpretations of a bit of text (this is incredibly handy for “keywords in context”).

    We’ve built some 40 odd parsers for a huge variety of language syntaxes with this, including nasty things such as PL/I (“every keyword can be used as a variable name”), C++14 (famously claimed to be hard to parse), COBOL, Fortran, Java, .. Yes, it does require some custom help in some languages,but far less than LL or LR parsing schemes. Being able change the grammar at will and generate a parser that works 99% of the time is a huge benefit.

    DMS is also a bit odd: it builds syntax trees that are isomorphic to the grammar. It does remove useless unit productions, terminals that don’t carry values, and it forms lists from productions that produce lists. The resulting ASTs are close enough in practice to what you want as an AST. This is a huge time saver when working with large grammars (COBOL and C++ fit this).

    • I appreciate that there may be parser generators now that can solve some of the problems I have encountered. Though from reading comments here, and elsewhere on this topic, I’m not convinced that any of them can universally solve the issues encountered.

      I disagree that a recurisve descent processor cannot be maintainable. It’s just like any other code, if written well it can be clean, easy to extend, and easy to understand.

      For me it’s an issue of time investment. If it takes me just as long to make a working grammar in the specialty language then I’m not sure it’s really saved me anything. I also shouldn’t have to be an expert in the PG to create a functioning parser. Things like line numbers and error detection and recovery should also be standard.

  11. Hi, Mortoray.

    Your parser article, is interesting, in deed, altought I prefere to have separate lexical modules and synatx modules, generated by the tools you describe.

    Nevertheless, I agree, there are several ways to acomplish the same task, that apply specifically, to programming language design, and related compiler tools.

    My University Thesis was about a Lex/Yacc replacement, with a procedural pascal style syntax, in a time when Bison & Flex didn’t exist.

    It didn’t matter, because I had available Flex & Yacc generators, that where unecesarilly complex, so I ended crafting my own…

    I have been investigating about “Plain C” and “C++” lexers & parsers, most of them are crafted “by hand”, or only partially tool generated.

    Therefore, your article, seems to point to the right direction.

    By the way, my tool generates both “Pure C” and “Procedural Pascal”, because most tools generate “Pure C”.

    Cheers.

    • The parser I have now for Leaf still kind of separates the lexing and syntactic structure. Instead of being a linear lexer I’d call it more of a tree lexer. It understand a bit about the language structure and produces a very generic tree output. It’s still recursive decent of course.

      After that I pass it through a converter as I call it. This is basically the tree transformation pass which produces the typed AST as output.

      It comes out at around 2500 lines of C++ code, which is okay for a rich language. Of course the language is also being designed with parsing in mind. Hopefully the standard parser generators will be able to handle it without too much difficulty.

  12. Hi mortoray

    Sorry about being late to the party – If you are still looking for a means to parse expressions easily, where the precedence of the operators is given by a simple enumeration, and the whole lot can be slotted into an otherwise recursive-descent framework, I found this article a while back, and it was inspirational for me:

    Google “Bob Nystrom pratt parser”

    The author presents a Java implementation of Top Down Operator Precedence parsing, which is essentially a design pattern that can augment recursive descent.

    The item of interest is the “Expression parseExpression(int precedence) {…}” function towards the end of the arcticle.

    IMO it’s a reasonably clear presentation, although in my implementation, I did not create all these objects for the operators. Instead, I just consumed the next token from the input stream with my various hand-coded lexer routines, and did two switch() statements. One to handle the prefix operators, and a second one inside the while() loop, to do the infix / postfix operators.

  13. Mortoray said: “For me it’s an issue of time investment. If it takes me just as long to make a working grammar in the specialty language then I’m not sure it’s really saved me anything. ”

    True. But then you’re assuming the specialty language is harder to learn, and doesn’t solve the problem better.

    As I said, we are using GLR. Our grammar productions are written in an incredibly simply BNF style, *all* grammar rules of the form:

    LHS = RHS1 RHS2 … RHSn ;

    where LHS is a nonterminal and RHSk may be nonterminals or terminals.
    Our only “process rules” is that we construct lists by writing a pair of grammar rules:

    LHS = ; — may by empty or element
    LHS = LHS ;

    That’s not much to learn.

    Regarding effectiveness: the underlying GLR engine (“full context free parsing”) means that we the following situations are handled without any further effort:

    * Left recursion (you can see it our standard list-style rule pair)
    * Right recursion (you can write that if you want)
    * Arbitrary lookahead
    * Ambiguous syntax (Note: for C++, the phrase ” A*B; ” is ambiguous,
    and you *do* have to parse it
    * Construction of the tree (done by the engine)
    * conversion to an “ast” (done by the engine by automatically removing nodes
    from the tree that can be reconstructed from the grammar).

    This makes you incredibly productive compared to most automated generators (which are usually named by their limitations, e.g., LL(2), LALR(1), LR(k), ..). And it means you don’t have to face these issue in a hand written parser.

    I think your objections are not supported by our experience. I’ll repeat: we have done C++ this way, and that is considered to be incredibly tough to parse. Mostly we used the grammar from the back of the C++ standard; we vary because the real compilers don’t match the standard.

  14. The main concerns raised in this article are all valid…. but I think my own parser generator LLLPG can solve all of them, because it’s designed to generate a particular style of output code that is meant to resemble hand-written code, and because it isn’t a control freak – you can override it when you need to. You still should be able to “think like the generator”, but if you write a parser by hand, you will sort of end up thinking like a parser generator anyway. Only less accurately.

    I put the rest of this message in a blog post: http://loyc.net/2016/why-dont-you-use-a-parser-generator.html

  15. It is actually pretty easy to get around the context-dependent lexer problem. In the textbooks, parsing is presented as a pipeline, where the input goes into the lexer and then the token go into the parser. But in fact, the lexer and the parser can be seen as co-routines. In an LR parser, when the parser calls the lexer it knows exactly what tokens are legal and can pass that information to the lexer, allowing it to do what is appropriate to get the next tokenl.

    Actually, if your language has a lot of reserved words, with anything that is not a reserved word being an identifier, the lex generators using the standard regular expression techniques produce tables that get very large in a hurry. And lexers are way easier to write by hand than parsers, and are easy to pick up and modify for a new languages. But the Shift/Reduce conflicts yacc/bison produce indicate places where the language requires additional context to be understood. Using recursive descent can eliminate the messages, but not the potential complexity.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s