Gagallium : Left-recursive versus right-recursive lists in LR parsers

Contrary to top-down (LL) parsers, which do not support left recursion, bottom-up (LR) parsers support both left recursion and right recursion. When defining a list-like construct, a user of an LR parser generator, such as Menhir, faces a choice between these two flavors. Which one should she prefer?

Two considerations guide this choice: expressiveness (which flavor leads to fewer conflicts?) and performance (which flavor leads to a more efficient parser?).

In this post, I am mainly interested in discussing expressiveness. I also comment on performance in the setting of Menhir.

As we will see, the bottom line is that neither formulation seems deeply preferable to the other.

An example where left recursion is preferable

This example is inspired by the syntax of C variable-argument functions. Suppose we wish to parse a list of arguments separated with commas. Furthermore, after this list, we wish to allow an optional ellipsis.

arguments:
  separated_list(COMMA, arg) vararg

vararg:
  /* epsilon */
| COMMA ELLIPSIS

There is a potential problem because the token COMMA appears both as a delimiter within separated_list and as the first symbol of vararg.

If one uses a left-recursive definition of separated_list, everything is fine (for reference, the code for left- and right-recursive definitions is listed at the end of this post). When an LR(1) parser encounters COMMA, it shifts, which means that it explores in parallel two possibilities: either this is the beginning of a new list element, or this is the beginning of a vararg. Looking at the next token, which should be either the beginning of an arg or ELLIPSIS, resolves this choice, and all is well.

If one uses a right-recursive definition of separated_list, then a conflict appears. When an LR(1) parser encounters the very first COMMA, it must immediately decide whether this COMMA announces the continuation of the list (in which case it should shift) or is the beginning of a vararg (in which case it should reduce what it has read up to this point to a list).

This example has been encountered in real life by a user of Menhir, and is also described online (Practical Considerations for LALR(1) Grammars). The author says: “left recursion is good, right recursion is bad”, and indeed this example seems to support this motto. However, there also situations where the converse is true, as shown by the next example.

An example where right recursion is preferable

Suppose we (again) wish to parse a list whose elements are separated with commas. Furthermore, we wish to allow an optional trailing comma to appear at the end of the list.

There are several ways of expressing this. An interesting view is this: either the comma is a separator, or it is a terminator. Thus, one writes:

arguments:
  LPAREN separated_or_terminated_list(COMMA, arg) RPAREN

separated_or_terminated_list(sep, elem):
  nonempty_list(terminated(elem, sep))
| separated_nonempty_list(sep, elem)

terminated(elem, sep):
  elem sep

Perhaps surprisingly, if one uses right-recursive definitions of nonempty_list and separated_nonempty_list, everything is fine. An LR(1) parser reads the whole list without ever reducing. It shifts all the way through, which means that it explores in parallel the two possibilities: either this is a terminated list, or this is a separated list. At the very end, when it finds either COMMA RPAREN or just RPAREN, it decides between the two alternatives, and performs all of the required reductions.

If one uses left-recursive definitions of nonempty_list and separated_nonempty_list, a conflict appears. When an LR(1) parser encounters the very first COMMA, it must immediately decide whether this COMMA is part of the first element in a terminated list (in which case it should shift) or is the first separator in a separated list (in which case it should reduce what it has so read so far to a separated list).

Moral of the story

The idea that “left recursion is good, right recursion is bad” is a myth, at least in terms of conflicts. There are ways of working with left recursion and ways of working with right recursion; they are just not the same.

A few words on flexible lists

So, what is the right way of describing a flexible list, where the final delimiter is optional?

One way is to describe it as a separated list, followed with an optional delimiter. In this case, a left-recursive version of separated lists must be used, for the reasons explained in the first part of this post. Here it is in the syntax of Menhir:

%inline flexible_list(delim, X):
  xs = separated_llist(delim, X) delim?
    { xs }

The definition of separated_llist is given at the end of this post.

Another way is to give a direct recursive definition, with two base cases and a single recursive case. In this case, it seems that we do not have a choice: this definition is naturally right-recursive.

right_flexible_list(delim, X):
| (* nothing *)
    { [] }
| x = X
    { [x] }
| x = X delim xs = right_flexible_list(delim, X)
    { x :: xs }

In the end, I like this definition, because I think it is the simplest and most direct. It leads to a smaller LR automaton and (potentially) to simpler syntax error messages.

Symmetrically, one can define a left_flexible_list where delimiters precede elements and where the first delimiter is optional. The various ways of doing this are left as an exercise for the reader!

A few words on performance.

The GNU Bison manual says “you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space”.

In Menhir, the LR stack is heap-allocated, so there is no hard limit on its size. If one wishes to parse a list on file and construct a list in memory, where the first list element on file becomes the first list element in memory, then both approaches are equivalent. In the right-recursive approach, the parser pushes O(n) elements onto the stack, then pops them off and constructs the desired list. In the left-recursive approach, the parser uses only O(1) stack space but constructs a reversed list, which takes up O(n) space and must be reversed afterwards.

The only two scenarios where left recursion may be more efficient is if one wishes to construct a reversed list (in which case left recursion allows saving a constant factor) or if one wishes to process the list elements on the fly, without allocating a list at all (in which case left recursion allows working in space O(1) instead of O(n)).

Code for separated lists (for reference)

Here is a right-recursive version of separated lists, as currently found in Menhir’s standard library:

%inline separated_list(separator, X):
  xs = loption(separated_nonempty_list(separator, X))
    { xs }

loption(X):
  /* nothing */
    { [] }
| x = X
    { x }

separated_nonempty_list(separator, X):
  x = X
    { [ x ] }
| x = X; separator; xs = separated_nonempty_list(separator, X)
    { x :: xs }

Here is a left-recursive version of separated lists:

reverse_separated_nonempty_llist(separator, X):
  x = X
    { [ x ] }
| xs = reverse_separated_nonempty_llist(separator, X); separator; x = X
    { x :: xs }

%inline reverse_separated_llist(separator, X):
    { [] }
| xs = reverse_separated_nonempty_llist(separator, X)
    { xs }

%inline separated_llist(separator, X):
  xs = reverse_separated_llist(separator, X)
    { List.rev xs }

Technical remark.

The symmetry between the left- and right-recursive versions of separated lists (above) is not quite perfect, because one uses loption whereas the other does not. One could define reverse_separated_llist(separator, X) as loption(reverse_separated_nonempty_llist(separator, X)) but this causes a conflict to re-appear in the vararg example. One can eliminate this conflict by declaring %inline loption, but doing that in the standard library might break existing code.