Gagallium : A toy type language (1) parsing and pretty-printing

In this post I will present a simple language of type definitions, and use it as an example use-case for the Pprint library.

Last summer, Jonathan wrote a (draft) blog post demonstrating the use of François’ Fix library to compute the variance of recursive type declarations. We agreed on a couple of changes to make but made the grave error of assigning some of the work to me. It never got done, and the post stayed in the draft limbo for months. Until now!

In this first post, I will present a small language of type definitions, such as you could find in the implementation of a typed functional language, and demonstrate how to parse it, and pretty-print it in a convenient way. Variance definition and computation will come later.

A simple language of structural type definitions

Consider the following OCaml type declarations:

type 'a finger =
  | Two of 'a * 'a
  | Three of 'a * 'a * 'a

The reason why this is often called an “algebraic datatype” is that it can be expressed in terms of more atomic type constructors, one for “sum” (disjoint union) and one for “product” (pairs):

type ('a, 'b) sum =
  | L of 'a
  | R of 'b

type 'a finger = ('a * 'a, 'a * ('a * 'a)) sum

let two x y : _ finger = L (x, y)
let three x y z : _ finger = R (x, (y, z))

To experiment with a simple language, it is reasonable to provide builtin type constructors for sum and products, and let users define their own types from there, instead of a full-fledged mechanism of algebraic/inductive datatypes as conveniently provided by OCaml, Haskell, Coq, Agda etc. Here is a datatype representing concrete syntax for such type expressions in OCaml:

type typ =
  | Sum of typ * typ                  (* ty1 + ty2 *)
  | Prod of typ * typ                 (* ty1 * ty2 *)
  | Fun of typ * typ                  (* ty1 -> ty2 *)
  | Unit                              (* unit *)
  | Var of string                     (* a *)
  | App of string * typ list          (* foo ty1 ty2 ty3 *)

type defs =
  (string * string list * typ) list   (* type foo a b c = ...
                                         and bar a b = ...     *)

Variables are globally-unique strings. A series of mutually-recursive definitions is just a list of triples, made up of the name of type, the list of its parameters, and the corresponding definition.

This is a restricted type language, convenient for a ML or System-F language with equi-recursive structural types. In particular, there is no partial application of parameterized types, and type parameters have kind star. We plan to discuss variance in a simple setting, so we didn’t cover higher-kinded types as they would suggest polarized higher-kinded subtyping – going a bit too far for an introductory post.

It is also rather limited as a realistic description of type declarations: in a production language you want to track source locations for use in error messages, and you may chose a safer representation of variables bindings than just string. Note that, yes, parameterized types have a standard curried syntax instead of the ('a, 'b) bar that we all know and love – sometimes you’ve got to change your mind about stuff.

Here’s the definition for list:

let def_list =
  [ "list", ["a"], Sum (Unit, Prod (Var "a", App ("list", [Var "a"]))) ]

And here are some mutually recursive type declarations. It’s rather hard to find good examples in fully structural type systems, as you can often fold all recursive declarations into one, so this one is a bit arbitrary.

(*
   type t a b = (a -> b) + u (a * b) b
   and  u c d = t c c
*)
let def_other =
  [ "t", ["a"; "b"], Sum (Fun (Var "a", Var "b"), App ("u", [Prod (Var "a", Var "b"); Var "b"]));
    "u", ["c"; "d"], App ("t", [Var "c"; Var "c"]);
  ]

Parsing the input language

The sources for the whole prototype are available online, so I need not bore you with the lexer, which is as obvious as you would expect (not supporting lexically-tricky things such as strings with escapes or nested comments help). The tokens are as follows:

%token TYPE AND
%token ARROW STAR PLUS EQUAL
%token <string>IDENT
%token LPAREN RPAREN
%token EOF

The syntax for mutual definitions is slightly tricky as the first declaration must start with the terminal type, while the other should start with and. Menhir’s parameterized rules let us express this without code duplication:

mutual_def:
| head=def(TYPE) rest=list(def(AND)) { head::rest }
def(kwd):
| kwd constr=IDENT params=list(IDENT) EQUAL body=ty 
  { (constr, params, body)}

Then comes the parser rules for type expressions. This is the slightly tricky part, as we want to handle precedence and associativity: a + b -> c -> d must be parsed as (a + b) -> (c -> d) for example. Most parser generators have special features to assign precedence and associativity to terminals, and menhir is no exception. But we will here do without, by manually refactoring the grammar to express the desired rules; this grammar structure will be useful to guide the pretty-printing design later on.

The idea is to manually structure the ty terminal as a sequence of “levels” (each with its own terminal), each layer having lower precedence than the next. The lowest precedence level is the one for ->, and the next is the one for + (* has precedence over +, and type constructor application has precedence above all infix symbols). So we will have a non-terminal func for “type expressions at the level of the arrow or above”, and a sum for type expression at the level of the plus, or above. For instance, func is defined as:

| a=sum ARROW b=func { Fun(a,b) }
| ty=sum { ty }

An expression above the precedence level of functions either has an infix arrow, or is a sum (or above). If it has an infix arrow in the middle, it has a sum on the left, and a function on the right. This forced level increment on the left-hand-side prevents it from being itself an arrow (if it was a -> b, you could parse a -> b -> c as (a -> b) -> c). In other words, level increase on the operands determines operator associativity.

Let me give the grammar productions in full.

ty:
| ty=func { ty }
func:
| a=sum ARROW b=func { Fun(a,b) }
| ty=sum { ty }
sum:
| a=sum PLUS b=prod { Sum(a,b) }
| ty=prod { ty }
prod:
| a=prod STAR b=app { Prod(a,b) }
| ty=app { ty }
app:
| constr=IDENT args=nonempty_list(atom) { App(constr, args) }
| ty=atom { ty }
atom:
| id=IDENT { Var id }
| LPAREN RPAREN { Unit }
| LPAREN t=ty RPAREN { t }

Notice in particular how levels wrap-around inside parenthesized expressions at the atom level. The seemingly-useless indirection from ty (used in the parentheses case) to func allows adding a new lowest level without having to modify the rest of the code.

Pretty-printing type declarations

There are exactly two choices to make to pretty-print our types, one for binary operators and one for constructor applications. How should they pretty-printed when they don’t fit a single line and we need to add line breaks? For binary operators, common choices are:

a +
b

a +
  b

a
+ b

a
  + b

PPrint supports all of these, so it is your choice to make (and argue about with your users). I’ll personally pick the second form for now. This is done with the infix combinator declared as such:

(** [infix n b middle left right] has the following flat layout: {[
left middle right
]}
and the following non-flat layout: {[
left middle
  right
]}
The parameter [n] controls the nesting of [right] (when not flat).
The parameter [b] controls the number of spaces between [left] and [middle]
(always) and between [middle] and [right] (when flat).
*)
val infix: int -> int -> document -> document -> document -> document

This gives (asking for 2-space indentation):

open PPrint

(*
  a +
    b
*)
let doc_op symb ty1 ty2 = infix 2 1 symb ty1 ty2

Similarly, application is done with the prefix operator:

(*
  f
    x y z
*)
let doc_app f xs = prefix 2 1 f (separate (break 1) xs)

We can now pretty-print type expressions:

let doc_typ ty =
  let rec typ ty = func ty
  and func = function
    | Fun(a,b) -> doc_op (string "->") (sum a) (func b)
    | ty -> sum ty
  and sum = function
    | Sum (a, b) -> doc_op plus (sum a) (prod b)
    | ty -> prod ty
  and prod = function
    | Prod (a, b) -> doc_op star (prod a) (app b)
    | ty -> (app ty)
  and app = function
    | App(constr, args) -> doc_app (string constr) (List.map atom args)
    | ty -> atom ty
  and atom = function
    | Var id -> string id
    | Unit -> string "()"
    | (Fun _ | Sum _ | Prod _ | App _) as ty ->
      group (parens (typ ty))
  in typ ty

Notice how the structure of explicit precedence levels in the grammar is reproduced in this pretty-printer. This ensures that parenthesis are inserted as needed, so that for example Prod (Sum (Var "a", Var "b), Var "c) is pretty-printed (a + b) * c and not the wrong a + b * c. The sum happens as the left-hand-side of a product; it is pretty-printed by calling prod (rather than typ that would print it directly without parens), which doesn’t handle it and falls back to app, etc., all the way to the last clause of the atom printer, which inserts parentheses before falling down to the base precedence level.

The rest of the printer is fairly standard:

let doc_def kwd (name, params, body) =
  let params = List.map string params in
  prefix 2 1
    (separate space [string kwd; doc_app (string name) params; equals])
    (doc_typ body)

let doc_defs = function
  | [] -> empty
  | def::defs ->
    separate (break 1)
      (doc_def "type" def :: List.map (doc_def "and") defs)

let print ?(len=60) doc =
  ToChannel.pretty 1. len stdout doc

And we can directly test this:

# print (doc_defs def_other); print_newline ();;
type t a b = (a -> b) + u (a * b) b
and u c d = t c c
- : unit = ()

You can find the complete, compilable source code at this address:

https://gitorious.org/gasche-snippets/tytoy/

PS: I’m interested in the tight correspondence between the way precedence and associativity are handler in the parser and in the printer. It is not just a coincidence of this blog post. For example, the dypgen parser generator has a very different way to handle precedence levels (it allows to declare abstract levels, impose an ordering on them, and tag recursive references with a level constraint), which also corresponds to a different way to handle parentheses while pretty-printing, keeping a numerical level of the pretty-printed expressions, and inserting parenthesis when a high-level operator has a low-level operand.

I initially implemented parenthesis pretty-printing this way, and Jonathan suggested I used the “pure refactoring” route instead, which is indeed simpler. I’ve also discussed the matter with Kaustuv Chaudhuri (which has a bit of code of its own), and there is an interesting formal treatment of the question by Norman Ramsey.