Gagallium : Verifying a parser for a C compiler (continued)

In my last post, I explained my work on verifying a parser for Compcert. I was mainly explaining the solutions I found to my performances problems. Here, I will discuss an other interesting problem I have, concerning the context sensitivity of the C99 grammar.

An alternative to the lexer hack

The lexer hack is the typical solution to the problem of context sensitivity of the C grammar outlined in the previous blog article and explained more in depth in this external blog article. It cannot be implemented in our parser, because it uses some side effect in the semantic actions of the grammar in order to give some information to the lexer, and, as you might know, side effects are forbidden in Coq.

Any Haskell programmer would say that the solution is to use a state monad, but both the semantic actions described in the grammar and the lexer would need to compute in this monad, in order to communicate with each other. Then, it becomes very hard to specify what is a valid semantic value for a token sequence, because one has to specify how the lexer and the parser semantic actions interleaves. This becomes a real nightmare when you allow the parser to ask the lexer for a lookahead token before deciding to call some semantic action...

So it is necessary to find another approach. After some thoughts, I remembered the parsing approach of mixfix operators in the Agda programming language, advertized in Gallium by our own Agda expert Nicolas Pouillard. Mixfix operators are syntactic constructions that can be created dynamically: this is comparable to the Notation command of Coq. The way they are handled in Agda is described in this paper. Roughly, the parser only analyzes the main structure of the file. The expressions are not analyzed at this point: they are left at the state of list of tokens. At a later stage of processing, an other module uses the extra context information it knows (notably the mixfix operators available) to parse expressions. This new information is obtained from the first parsing phase, which brings enough structure to go on.

Here, the situation is very similar: we need some more context information (whether a given identifier is a type or a variable name) in order to parse expressions and declarations. We can use a similar technique: if the parser is not able to parse those, then it should store them as a list of tokens in the abstract syntax tree. A second pass in then necessary to parse those "chunks" correctly. Recursively, this second pass can delay parsing for smaller chunks, and so on.

Obviously, I have omitted some details here. First, it requires some serious engineering on the C99 grammar in order to separate it into a "safe" part, where the extra information given by the context is not needed, and a "dangerous" part, where the parser needs that the two different flavors of identifiers (type names and variable names) are represented by two different tokens. Second, recognizing the end of these chunks without actually parsing them requires recognizing some patterns, typically like matching parentheses.

Avoiding polluting the AST with chunks

Storing chunks in the abstract syntax tree has a major drawback: it requires adding some constructors in its types for those chunks, which do not have any real sense in the language. The solution I use instead is to make all my semantic actions compute in a state monad, which passes the contextual environment information: while parsing, the semantic actions build the syntax tree in the form of a tree of closures. Once something has been parsed, the caller gives the initial environment as a parameter to the head closure, and the environment information propagates along the tree.

When there is a chunk in some node of the AST, the corresponding semantic action first uses the environment information to convert identifier tokens to their right flavor, then recursively parses the chunk, and finally gives the current environment to the returned semantic value, so that it can itself evaluate.

This monad is a bit more complicated than a simple state monad, because it also handles error (nested parsing can fail), and open recursion (the parser has to call itself, when it is not already defined).

Conclusion

Here is a working version of the parser, using those ideas. A bit surprisingly, using all this monad machinery and delaying parsing does not decrease the performances too much. I have still some troubles proving it, and I'm looking forward to ideas to state all the invariants in a not-too-verbose way.

This work is still in progress, so I would be happy if you could give me your comments or ideas. But, in any case, I am now convinced that it is really possible to have a formalization of a parser for the C language, and particularly for the lexer hack. And this is a good thing, because, as its name suggest, the lexer hack is a hack, whose correctness is far from clear.

I have to say that, even if this work of proving a parser for a C compiler is sometimes quite tiresome, it allowed me to learn things in many different domains, from designing powerful automaton tactics in Coq to OCaml code optimization and LR(1) hackery: it's been a very challenging and rewarding experience.

If I wanted to work on that again in the future, there is still many things one would want to prove: does the parser terminates on all inputs? Are we sure it does not ask for a lookahead token after the end of the stream (that could be a problem in interactive languages, because it could block the input)?