Gagallium : The 6 parameters of (’a, ’b, ’c, ’d, ’e, ’f) format6

The infamous format6 type is the basis of the hackish but damn convenient, type-safe way in which OCaml handles format strings:

Printf.printf "%d) %s -> %f\n"
  3 "foo" 5.12

The first argument of printf in this example is not a string, but a format string (they share the same literal syntax, but have different type, and there is a small hack in the type-inference engine to make this possible). It’s type, which you can get by asking ("%d) %s -> %f\n" : (_,_,_,_,_,_) format6) in a toplevel, is a mouthful (and I renamed the variables for better readability):

(int -> string -> float -> 'f, 'b, 'c, 'd, 'd, 'f) format6

What do those six arcane parameters mean? In the process of reviewing Benoît Vaugon’s work on using GADTs to re-implement format functions, I’ve finally felt the need (after years of delighted oblivion) to understand each of those parameters. And that came after several days of hacking on easier-to-understand ten-parameters GADT functions; hopefully most of our readers will never need this information, but the probability that I need to refresh my memory in the future is strong enough for me to write this blog post.

The main take-away is that this type slowly evolved, from a more reasonable ('a, 'b, 'c) format type, to the monster that lurks in the documentation today. Advanced features of format strings needed more and more parameters to carry their type information. It is possible to implement format strings with less features and simpler types, and it may be possible to implement the same format strings with types that are easier to understand, but I need to understand the interface as it exists.

I built my own understanding, not by reading the documentation (though I tried that first), but by imagining layers of features, from the simple to the advanced. This is this post’s approach. It does necessarily correspond to the actual chronological evolution of format types – Pierre Weis would know about that.

At the beginning were `printf` and `sprintf`

# Printf.printf "%d) %s -> %f\n";;
- : int -> string -> float -> unit = <fun>
# Printf.sprintf "%d) %s -> %f\n";;
- : int -> string -> float -> string = <fun>

Those two functions differ by their return type: one prints and returns nothing (unit), the other returns a string. This parameter is stored in the sixth-parameter, 'f, and is called “result type for the printf-like functions” in the documentation.

Of course, the format string does not need to impose any particular value to this parameter: a given format string can be used by a function with any possible result type, hence the undetermined variable 'f in the format type above. It is the printf functions that force a precise instantiation of the parameter: Printf.printf has a type of the form ('a, ..., unit) format6 -> 'a.

The type 'a mentioned in the type of printf is a large function type, with one parameter for each conversion specification (%d, %s, etc.), and whose return type is enforced by the format to be equal to the last parameter of the format. In the case of "%d) %s -> %f\n", this type is int -> string -> float -> 'a, where 'a is the same variable present in the last parameter of the format – the result type. This type depends only on the format string, not at all on the printing (or scanning) function used.

To summarize, in ('a, 'b, 'c, 'd, 'e, 'f) format6, we’ve seen that:

'f is what you get after you pass all the parameters to a printf-like function
'a is the big function type that expects all those parameters and returns 'f

`'b` and `'c` for user-defined printers:

You may know about %a and the lesser-known (but actually simpler) %t, used as follow:

Printf.sprintf "%d) %a -> %f"
  3
  print_foo foo
  6.23

The idea is that instead of converting the value foo to a string first, you can, at printf time, pass a “user-defined printer” print_foo as an extra argument that tells how to convert foo into a string. This is easy enough to handle typing-wise: the format %a adds two parameters to the big function type, instead of just one.

In a non-existing ideal world, print_foo used in sprintf would have the convenient type foo -> string. However, a single %a must work with many printing function: for sprintf, the user-defined should return a string, for printf it should just print and return unit, with fprintf it needs to be passed the channel in which we’re printing, or with bprintf the target buffer. So this determines two type parameters in ('a, 'b, 'c, 'd, 'e, 'f) format6:

'b is the type of the extra information that user-defined printers may need (Buffer.t for bprintf, output_channel for fprintf, etc.)
'c is the return type of the user-defined printers accepted by the function (string for sprintf, unit for most others).

There are two remarks. First, it happens that printf use out_channel as second parameter, while you would probably expect unit. This is strange API design, but the rationale is to let you reuse your user-defined printers for both printf and fprintf.

Second, you may remark that the 'c parameter looks suspiciously like “the result type of printf-lie functions”, 'f. Indeed, all current printf-like functions use, to my knowledge, the same type for 'f and 'c. But it would be very reasonable, for performance reasons, to design a different version of sprintf that returns a string, but whose user-defined sub-printers put result in a Buffer.t parameter and thus just return unit.

Scanf is just a continuation away

Consider the difference between printing and scanning:

Printf.printf "%d) %s %f"
  3 "foo" 7.23;;

Scanf.scanf "%d) %s %f"
  (fun num name value -> ....)

When printing with “%d) %s %f”, the user has to provide an int parameter, a string parameter, and a float parameter, hence the type of the form int -> string -> float -> .... With scanning, you have to give them to the user, which looks like a fairly different activity. How can you reuse the same typing of the format string?

Fortunately, expressing scanning with a continuation (the function (fun num name value -> ...)) allows precisely to reuse a type of the form int -> string -> float -> ... to describe scanning: just say that for any type 'f, if the user gives a int -> string -> float -> 'f, scanf will return a 'f.

In a simple world, the type of scanf would thus be of the following form:

('a, ..., 'f) format6 -> 'a -> 'f

(where it is understood that for any format string, 'a will actually be of the form t1 -> t2 -> ... -> 'f)

(Alternate design idea: have a new type parameter that stores, instead of a big arrow type, the product of all parameter types (int * string * float in our example), and make scanf return this directly instead of using a continuation-passing style. The present style has the property of being easier to implement with GADTs, but that probably wasn’t a design consideration.)

In fact, the typing of scanf is a bit more complicated, as explained in the next section.

`'d` and `'e` for user-defined readers

Just as user-defined printers, to write formats in a more high-level style it is possible to provide user-defined scanners with %r:

let read_foo scanbuf = ...

Scanf.scanf "%d) %r -> %f"
  read_foo
  (fun n foo x -> ...)

Now the typing implications of this for format strings are a bit tricky. As we read three values of type int, foo and float, the naive typing of scanf proposed earlier

('a, ..., 'f) -> ('a -> 'f)

would specialize to, in this instance:

(int -> foo -> float -> 'f, ..., 'f) format6
 -> ((int -> foo -> float -> 'f) -> 'f)

which is not what we want, as there is no room for the extra read_foo parameter.

The argument expected by scanf after the format string is not just the continuation anymore, we need an extra argument for each user-defined reader. Yet those extra arguments should not appear in the type of the continuation.

What we need to do is to express the relation between the simple type 'a -> 'f (from continuation to result), which is the return type of scanf fmt in absence of user-defined readers, and the real return type of scanf. This is what our last two parameters do:

the second-last parameter, 'e, is forced by scanf functions to be equal to 'a -> 'f
the parameter 'd is the actual type of scanf fmt
the typing of format strings will generate constraints on the relation between 'd and 'e, depending on the number of %r present in the format; without any %r they are equal.

Consider our running example without any %r:

# ( "%d) %s -> %f" : (_, _, _, _, _, _) format6 );;
- : (int -> string -> float -> 'f,
     'b, 'c, 'd, 'd, 'f) format6

The two parameters I’m talking about are constrained to be equal to the same variables 'd. On the contrary, if we use %r instead of %s:

# ( "%d) %r -> %f" : (_,_,_,_,_,_) format6 );;
- : (int -> 'x -> float -> 'f,
     'b, 'c, ('b -> 'x) -> 'e, 'e, 'f) format6

Suddenly they are distinct, with one equal to the unknown 'e (that will be forced to equal (... -> 'f) -> 'f when applying scanf), and the other is ('b -> 'x) -> 'e: it adds on top of 'e an extra parameter (the user-defined format) which must return a 'x (if I pass the reader read_foo, this will be equated to foo) and takes as input a 'b, which is the type of the extra parameter of user-defined printers or readers (scanf will set it to Scanning.in_channel).

Benoît’s GADT declarations are surprisingly efficient at showing you precisely what happens to the type parameters when you add a new conversion description to an existing format string. In particular, the constructor for user-defined readers reads as follow:

| Reader :                                                 (* %r *)
    ('a, 'b, 'c, 'd, 'e, 'f) fmt ->
      ('x -> 'a, 'b, 'c, ('b -> 'x) -> 'd, 'e, 'f) fmt

You can very well see that an argument of type 'x (the result type of the user-defined result) is added to the parameter 'a (the type of the continuation finally passed to scanf), while an argument of type ('b -> 'x) is added to 'd (the type of the arguments following the format string).

PS: Extra credit for the bonus question: why do I use (fmt : (_, _, _, _, _, _) format6) instead of the more convenient to write format_of_string fmt?

At the beginning were printf and sprintf

'b and 'c for user-defined printers:

Scanf is just a continuation away

'd and 'e for user-defined readers

At the beginning were `printf` and `sprintf`

`'b` and `'c` for user-defined printers:

`'d` and `'e` for user-defined readers