A toy type language (1) parsing and pretty-printing
- January 14, 2014
- Last updated on 2014/05/01
In this post I will present a simple language of type definitions, and use it as an example use-case for the Pprint library.
Last summer, Jonathan wrote a (draft) blog post demonstrating the use of François’ Fix library to compute the variance of recursive type declarations. We agreed on a couple of changes to make but made the grave error of assigning some of the work to me. It never got done, and the post stayed in the draft limbo for months. Until now!
In this first post, I will present a small language of type definitions, such as you could find in the implementation of a typed functional language, and demonstrate how to parse it, and pretty-print it in a convenient way. Variance definition and computation will come later.
A simple language of structural type definitions
Consider the following OCaml type declarations:
type 'a finger =
| Two of 'a * 'a
| Three of 'a * 'a * 'a
The reason why this is often called an “algebraic datatype” is that it can be expressed in terms of more atomic type constructors, one for “sum” (disjoint union) and one for “product” (pairs):
type ('a, 'b) sum =
| L of 'a
| R of 'b
type 'a finger = ('a * 'a, 'a * ('a * 'a)) sum
let two x y : _ finger = L (x, y)
let three x y z : _ finger = R (x, (y, z))
To experiment with a simple language, it is reasonable to provide builtin type constructors for sum and products, and let users define their own types from there, instead of a full-fledged mechanism of algebraic/inductive datatypes as conveniently provided by OCaml, Haskell, Coq, Agda etc. Here is a datatype representing concrete syntax for such type expressions in OCaml:
type typ =
| Sum of typ * typ (* ty1 + ty2 *)
| Prod of typ * typ (* ty1 * ty2 *)
| Fun of typ * typ (* ty1 -> ty2 *)
| Unit (* unit *)
| Var of string (* a *)
| App of string * typ list (* foo ty1 ty2 ty3 *)
type defs =
(string * string list * typ) list (* type foo a b c = ...
and bar a b = ... *)
Variables are globally-unique strings. A series of mutually-recursive definitions is just a list of triples, made up of the name of type, the list of its parameters, and the corresponding definition.
This is a restricted type language, convenient for a ML or System-F language with equi-recursive structural types. In particular, there is no partial application of parameterized types, and type parameters have kind star. We plan to discuss variance in a simple setting, so we didn’t cover higher-kinded types as they would suggest polarized higher-kinded subtyping – going a bit too far for an introductory post.
It is also rather limited as a realistic description of type
declarations: in a production language you want to track source
locations for use in error messages, and you may chose a safer
representation of variables bindings than just string
. Note
that, yes, parameterized types have a standard curried syntax instead of
the ('a, 'b) bar
that we all know and love – sometimes
you’ve got to change your mind about stuff.
Here’s the definition for list
:
let def_list =
[ "list", ["a"], Sum (Unit, Prod (Var "a", App ("list", [Var "a"]))) ]
And here are some mutually recursive type declarations. It’s rather hard to find good examples in fully structural type systems, as you can often fold all recursive declarations into one, so this one is a bit arbitrary.
(*
type t a b = (a -> b) + u (a * b) b
and u c d = t c c
*)
let def_other =
[ "t", ["a"; "b"], Sum (Fun (Var "a", Var "b"), App ("u", [Prod (Var "a", Var "b"); Var "b"]));
"u", ["c"; "d"], App ("t", [Var "c"; Var "c"]);
]
Parsing the input language
The sources for the whole prototype are available online, so I need not bore you with the lexer, which is as obvious as you would expect (not supporting lexically-tricky things such as strings with escapes or nested comments help). The tokens are as follows:
%token TYPE AND
%token ARROW STAR PLUS EQUAL
%token <string>IDENT
%token LPAREN RPAREN
%token EOF
The syntax for mutual definitions is slightly tricky as the first
declaration must start with the terminal type
, while the
other should start with and
. Menhir’s parameterized rules
let us express this without code duplication:
mutual_def:
| head=def(TYPE) rest=list(def(AND)) { head::rest }
def(kwd):
| kwd constr=IDENT params=list(IDENT) EQUAL body=ty
{ (constr, params, body)}
Then comes the parser rules for type expressions. This is the
slightly tricky part, as we want to handle precedence and associativity:
a + b -> c -> d
must be parsed as
(a + b) -> (c -> d)
for example. Most parser
generators have special features to assign precedence and associativity
to terminals, and menhir is no exception. But we will here do without,
by manually refactoring the grammar to express the desired rules; this
grammar structure will be useful to guide the pretty-printing design
later on.
The idea is to manually structure the ty
terminal as a
sequence of “levels” (each with its own terminal), each layer having
lower precedence than the next. The lowest precedence level is the one
for ->
, and the next is the one for +
(*
has precedence over +
, and type constructor
application has precedence above all infix symbols). So we will have a
non-terminal func
for “type expressions at the level of the
arrow or above”, and a sum
for type expression at the level
of the plus, or above. For instance, func
is defined
as:
| a=sum ARROW b=func { Fun(a,b) }
| ty=sum { ty }
An expression above the precedence level of functions either has an
infix arrow, or is a sum (or above). If it has an infix arrow in the
middle, it has a sum on the left, and a function on the right. This
forced level increment on the left-hand-side prevents it from being
itself an arrow (if it was a -> b
, you could parse
a -> b -> c
as (a -> b) -> c
). In
other words, level increase on the operands determines operator
associativity.
Let me give the grammar productions in full.
ty:
| ty=func { ty }
func:
| a=sum ARROW b=func { Fun(a,b) }
| ty=sum { ty }
sum:
| a=sum PLUS b=prod { Sum(a,b) }
| ty=prod { ty }
prod:
| a=prod STAR b=app { Prod(a,b) }
| ty=app { ty }
app:
| constr=IDENT args=nonempty_list(atom) { App(constr, args) }
| ty=atom { ty }
atom:
| id=IDENT { Var id }
| LPAREN RPAREN { Unit }
| LPAREN t=ty RPAREN { t }
Notice in particular how levels wrap-around inside parenthesized
expressions at the atom
level. The seemingly-useless
indirection from ty
(used in the parentheses case) to
func
allows adding a new lowest level without having to
modify the rest of the code.
Pretty-printing type declarations
There are exactly two choices to make to pretty-print our types, one for binary operators and one for constructor applications. How should they pretty-printed when they don’t fit a single line and we need to add line breaks? For binary operators, common choices are:
a +
b
a +
b
a
+ b
a
+ b
PPrint
supports all of these, so it is your choice to make (and argue about
with your users). I’ll personally pick the second form for now. This is
done with the infix
combinator declared as such:
(** [infix n b middle left right] has the following flat layout: {[ left middle right ]} and the following non-flat layout: {[ left middle right ]} The parameter [n] controls the nesting of [right] (when not flat). The parameter [b] controls the number of spaces between [left] and [middle] (always) and between [middle] and [right] (when flat). *) val infix: int -> int -> document -> document -> document -> document
This gives (asking for 2-space indentation):
open PPrint
(*
a +
b
*)
let doc_op symb ty1 ty2 = infix 2 1 symb ty1 ty2
Similarly, application is done with the prefix
operator:
(*
f
x y z
*)
let doc_app f xs = prefix 2 1 f (separate (break 1) xs)
We can now pretty-print type expressions:
let doc_typ ty =
let rec typ ty = func ty
and func = function
| Fun(a,b) -> doc_op (string "->") (sum a) (func b)
| ty -> sum ty
and sum = function
| Sum (a, b) -> doc_op plus (sum a) (prod b)
| ty -> prod ty
and prod = function
| Prod (a, b) -> doc_op star (prod a) (app b)
| ty -> (app ty)
and app = function
| App(constr, args) -> doc_app (string constr) (List.map atom args)
| ty -> atom ty
and atom = function
| Var id -> string id
| Unit -> string "()"
| (Fun _ | Sum _ | Prod _ | App _) as ty ->
group (parens (typ ty))
in typ ty
Notice how the structure of explicit precedence levels in the grammar
is reproduced in this pretty-printer. This ensures that parenthesis are
inserted as needed, so that for example
Prod (Sum (Var "a", Var "b), Var "c)
is pretty-printed
(a + b) * c
and not the wrong a + b * c
. The
sum happens as the left-hand-side of a product; it is pretty-printed by
calling prod
(rather than typ
that would print
it directly without parens), which doesn’t handle it and falls back to
app
, etc., all the way to the last clause of the
atom
printer, which inserts parentheses before falling down
to the base precedence level.
The rest of the printer is fairly standard:
let doc_def kwd (name, params, body) =
let params = List.map string params in
prefix 2 1
(separate space [string kwd; doc_app (string name) params; equals])
(doc_typ body)
let doc_defs = function
| [] -> empty
| def::defs ->
separate (break 1)
(doc_def "type" def :: List.map (doc_def "and") defs)
let print ?(len=60) doc =
ToChannel.pretty 1. len stdout doc
And we can directly test this:
# print (doc_defs def_other); print_newline ();;
type t a b = (a -> b) + u (a * b) b
and u c d = t c c
- : unit = ()
You can find the complete, compilable source code at this address:
PS: I’m interested in the tight correspondence between the way precedence and associativity are handler in the parser and in the printer. It is not just a coincidence of this blog post. For example, the dypgen parser generator has a very different way to handle precedence levels (it allows to declare abstract levels, impose an ordering on them, and tag recursive references with a level constraint), which also corresponds to a different way to handle parentheses while pretty-printing, keeping a numerical level of the pretty-printed expressions, and inserting parenthesis when a high-level operator has a low-level operand.
I initially implemented parenthesis pretty-printing this way, and Jonathan suggested I used the “pure refactoring” route instead, which is indeed simpler. I’ve also discussed the matter with Kaustuv Chaudhuri (which has a bit of code of its own), and there is an interesting formal treatment of the question by Norman Ramsey.