Left-recursive versus right-recursive lists in LR parsers
- January 21, 2015
- Last updated on 2015/01/22
Contrary to top-down (LL) parsers, which do not support left recursion, bottom-up (LR) parsers support both left recursion and right recursion. When defining a list-like construct, a user of an LR parser generator, such as Menhir, faces a choice between these two flavors. Which one should she prefer?
Two considerations guide this choice: expressiveness (which flavor leads to fewer conflicts?) and performance (which flavor leads to a more efficient parser?).
In this post, I am mainly interested in discussing expressiveness. I also comment on performance in the setting of Menhir.
As we will see, the bottom line is that neither formulation seems deeply preferable to the other.
An example where left recursion is preferable
This example is inspired by the syntax of C variable-argument functions. Suppose we wish to parse a list of arguments separated with commas. Furthermore, after this list, we wish to allow an optional ellipsis.
arguments:
separated_list(COMMA, arg) vararg
vararg:
/* epsilon */
| COMMA ELLIPSIS
There is a potential problem because the token COMMA
appears both as a delimiter within separated_list
and as
the first symbol of vararg
.
If one uses a left-recursive definition of
separated_list
, everything is fine (for reference, the code
for left- and right-recursive definitions is listed at the end of this post).
When an LR(1) parser encounters COMMA
, it shifts, which
means that it explores in parallel two possibilities: either this is the
beginning of a new list element, or this is the beginning of a
vararg
. Looking at the next token, which should be either
the beginning of an arg
or ELLIPSIS
, resolves
this choice, and all is well.
If one uses a right-recursive definition of
separated_list
, then a conflict appears. When an LR(1)
parser encounters the very first COMMA
, it must immediately
decide whether this COMMA
announces the continuation of the
list (in which case it should shift) or is the beginning of a
vararg
(in which case it should reduce what it has read up
to this point to a list).
This example has been encountered in real life by a user of Menhir, and is also described online (Practical Considerations for LALR(1) Grammars). The author says: “left recursion is good, right recursion is bad”, and indeed this example seems to support this motto. However, there also situations where the converse is true, as shown by the next example.
An example where right recursion is preferable
Suppose we (again) wish to parse a list whose elements are separated with commas. Furthermore, we wish to allow an optional trailing comma to appear at the end of the list.
There are several ways of expressing this. An interesting view is this: either the comma is a separator, or it is a terminator. Thus, one writes:
arguments:
LPAREN separated_or_terminated_list(COMMA, arg) RPAREN
separated_or_terminated_list(sep, elem):
nonempty_list(terminated(elem, sep))
| separated_nonempty_list(sep, elem)
terminated(elem, sep):
elem sep
Perhaps surprisingly, if one uses right-recursive definitions of
nonempty_list
and separated_nonempty_list
,
everything is fine. An LR(1) parser reads the whole list without ever
reducing. It shifts all the way through, which means that it explores in
parallel the two possibilities: either this is a terminated list, or
this is a separated list. At the very end, when it finds either
COMMA RPAREN
or just RPAREN
, it decides
between the two alternatives, and performs all of the required
reductions.
If one uses left-recursive definitions of nonempty_list
and separated_nonempty_list
, a conflict appears. When an
LR(1) parser encounters the very first COMMA
, it must
immediately decide whether this COMMA
is part of the first
element in a terminated list (in which case it should shift) or is the
first separator in a separated list (in which case it should reduce what
it has so read so far to a separated list).
Moral of the story
The idea that “left recursion is good, right recursion is bad” is a myth, at least in terms of conflicts. There are ways of working with left recursion and ways of working with right recursion; they are just not the same.
A few words on flexible lists
So, what is the right way of describing a flexible list, where the final delimiter is optional?
One way is to describe it as a separated list, followed with an optional delimiter. In this case, a left-recursive version of separated lists must be used, for the reasons explained in the first part of this post. Here it is in the syntax of Menhir:
%inline flexible_list(delim, X):
xs = separated_llist(delim, X) delim?
{ xs }
The definition of separated_llist
is given at the end of
this post.
Another way is to give a direct recursive definition, with two base cases and a single recursive case. In this case, it seems that we do not have a choice: this definition is naturally right-recursive.
right_flexible_list(delim, X):
| (* nothing *)
{ [] }
| x = X
{ [x] }
| x = X delim xs = right_flexible_list(delim, X)
{ x :: xs }
In the end, I like this definition, because I think it is the simplest and most direct. It leads to a smaller LR automaton and (potentially) to simpler syntax error messages.
Symmetrically, one can define a left_flexible_list
where
delimiters precede elements and where the first delimiter is optional.
The various ways of doing this are left as an exercise for the
reader!
A few words on performance.
The GNU Bison manual says “you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space”.
In Menhir, the LR stack is heap-allocated, so there is no hard limit on its size. If one wishes to parse a list on file and construct a list in memory, where the first list element on file becomes the first list element in memory, then both approaches are equivalent. In the right-recursive approach, the parser pushes O(n) elements onto the stack, then pops them off and constructs the desired list. In the left-recursive approach, the parser uses only O(1) stack space but constructs a reversed list, which takes up O(n) space and must be reversed afterwards.
The only two scenarios where left recursion may be more efficient is if one wishes to construct a reversed list (in which case left recursion allows saving a constant factor) or if one wishes to process the list elements on the fly, without allocating a list at all (in which case left recursion allows working in space O(1) instead of O(n)).
Code for separated lists (for reference)
Here is a right-recursive version of separated lists, as currently found in Menhir’s standard library:
%inline separated_list(separator, X):
xs = loption(separated_nonempty_list(separator, X))
{ xs }
loption(X):
/* nothing */
{ [] }
| x = X
{ x }
separated_nonempty_list(separator, X):
x = X
{ [ x ] }
| x = X; separator; xs = separated_nonempty_list(separator, X)
{ x :: xs }
Here is a left-recursive version of separated lists:
reverse_separated_nonempty_llist(separator, X):
x = X
{ [ x ] }
| xs = reverse_separated_nonempty_llist(separator, X); separator; x = X
{ x :: xs }
%inline reverse_separated_llist(separator, X):
{ [] }
| xs = reverse_separated_nonempty_llist(separator, X)
{ xs }
%inline separated_llist(separator, X):
xs = reverse_separated_llist(separator, X)
{ List.rev xs }
Technical remark.
The symmetry between the left- and right-recursive versions of
separated lists (above) is not quite perfect, because one uses
loption
whereas the other does not. One could define
reverse_separated_llist(separator, X)
as
loption(reverse_separated_nonempty_llist(separator, X))
but
this causes a conflict to re-appear in the vararg
example.
One can eliminate this conflict by declaring
%inline loption
, but doing that in the standard library
might break existing code.