The 6 parameters of (’a, ’b, ’c, ’d, ’e, ’f) format6
- April 18, 2014
The infamous format6 type is the basis of the hackish
but damn convenient, type-safe way in which OCaml handles
format strings:
Printf.printf "%d) %s -> %f\n"
3 "foo" 5.12
The first argument of printf in this example is not a
string, but a format string (they share the same literal syntax, but
have different type, and there is a small hack in the type-inference
engine to make this possible). It’s type, which you can get by asking
("%d) %s -> %f\n" : (_,_,_,_,_,_) format6) in a
toplevel, is a mouthful (and I renamed the variables for better
readability):
(int -> string -> float -> 'f, 'b, 'c, 'd, 'd, 'f) format6
What do those six arcane parameters mean? In the process of reviewing Benoît Vaugon’s work on using GADTs to re-implement format functions, I’ve finally felt the need (after years of delighted oblivion) to understand each of those parameters. And that came after several days of hacking on easier-to-understand ten-parameters GADT functions; hopefully most of our readers will never need this information, but the probability that I need to refresh my memory in the future is strong enough for me to write this blog post.
The main take-away is that this type slowly evolved, from a more
reasonable ('a, 'b, 'c) format type, to the monster that
lurks in the documentation today. Advanced features of format strings
needed more and more parameters to carry their type information. It is
possible to implement format strings with less features and simpler
types, and it may be possible to implement the same format strings with
types that are easier to understand, but I need to understand the
interface as it exists.
I built my own understanding, not by reading the documentation (though I tried that first), but by imagining layers of features, from the simple to the advanced. This is this post’s approach. It does necessarily correspond to the actual chronological evolution of format types – Pierre Weis would know about that.
At the beginning were
printf and sprintf
# Printf.printf "%d) %s -> %f\n";;
- : int -> string -> float -> unit = <fun>
# Printf.sprintf "%d) %s -> %f\n";;
- : int -> string -> float -> string = <fun>
Those two functions differ by their return type: one prints and
returns nothing (unit), the other returns a
string. This parameter is stored in the sixth-parameter,
'f, and is called “result type for the printf-like
functions” in the documentation.
Of course, the format string does not need to impose any particular
value to this parameter: a given format string can be used by a function
with any possible result type, hence the undetermined variable
'f in the format type above. It is the printf
functions that force a precise instantiation of the parameter:
Printf.printf has a type of the form
('a, ..., unit) format6 -> 'a.
The type 'a mentioned in the type of printf
is a large function type, with one parameter for each conversion
specification (%d, %s, etc.), and whose return
type is enforced by the format to be equal to the last parameter of the
format. In the case of "%d) %s -> %f\n", this type is
int -> string -> float -> 'a, where
'a is the same variable present in the last parameter of
the format – the result type. This type depends only on the format
string, not at all on the printing (or scanning) function used.
To summarize, in ('a, 'b, 'c, 'd, 'e, 'f) format6, we’ve
seen that:
'fis what you get after you pass all the parameters to a printf-like function'ais the big function type that expects all those parameters and returns'f
'b and
'c for user-defined printers:
You may know about %a and the lesser-known (but actually
simpler) %t, used as follow:
Printf.sprintf "%d) %a -> %f"
3
print_foo foo
6.23
The idea is that instead of converting the value foo to
a string first, you can, at printf time, pass a “user-defined printer”
print_foo as an extra argument that tells how to convert
foo into a string. This is easy enough to handle
typing-wise: the format %a adds two parameters to the big function type,
instead of just one.
In a non-existing ideal world, print_foo used in
sprintf would have the convenient type
foo -> string. However, a single %a must
work with many printing function: for sprintf, the
user-defined should return a string, for printf it should
just print and return unit, with fprintf it
needs to be passed the channel in which we’re printing, or with
bprintf the target buffer. So this determines two type
parameters in ('a, 'b, 'c, 'd, 'e, 'f) format6:
'bis the type of the extra information that user-defined printers may need (Buffer.tforbprintf,output_channelforfprintf, etc.)'cis the return type of the user-defined printers accepted by the function (stringforsprintf,unitfor most others).
There are two remarks. First, it happens that printf use
out_channel as second parameter, while you would probably
expect unit. This is strange API design, but the rationale
is to let you reuse your user-defined printers for both
printf and fprintf.
Second, you may remark that the 'c parameter looks
suspiciously like “the result type of printf-lie functions”,
'f. Indeed, all current printf-like functions
use, to my knowledge, the same type for 'f and
'c. But it would be very reasonable, for performance
reasons, to design a different version of sprintf that
returns a string, but whose user-defined sub-printers put result in a
Buffer.t parameter and thus just return
unit.
Scanf is just a continuation away
Consider the difference between printing and scanning:
Printf.printf "%d) %s %f"
3 "foo" 7.23;;
Scanf.scanf "%d) %s %f"
(fun num name value -> ....)
When printing with “%d) %s %f”, the user has to provide an int
parameter, a string parameter, and a float parameter, hence the type of
the form int -> string -> float -> .... With
scanning, you have to give them to the user, which looks like a
fairly different activity. How can you reuse the same typing of the
format string?
Fortunately, expressing scanning with a continuation (the function
(fun num name value -> ...)) allows precisely to reuse a
type of the form int -> string -> float -> ... to
describe scanning: just say that for any type 'f, if the
user gives a int -> string -> float -> 'f,
scanf will return a 'f.
In a simple world, the type of scanf would thus be of
the following form:
('a, ..., 'f) format6 -> 'a -> 'f
(where it is understood that for any format string, 'a
will actually be of the form
t1 -> t2 -> ... -> 'f)
(Alternate design idea: have a new type parameter that stores,
instead of a big arrow type, the product of all parameter types
(int * string * float in our example), and make
scanf return this directly instead of using a
continuation-passing style. The present style has the property of being
easier to implement with GADTs, but that probably wasn’t a design
consideration.)
In fact, the typing of scanf is a bit more complicated,
as explained in the next section.
'd and
'e for user-defined readers
Just as user-defined printers, to write formats in a more high-level
style it is possible to provide user-defined scanners with
%r:
let read_foo scanbuf = ...
Scanf.scanf "%d) %r -> %f"
read_foo
(fun n foo x -> ...)
Now the typing implications of this for format strings are a bit
tricky. As we read three values of type int,
foo and float, the naive typing of
scanf proposed earlier
('a, ..., 'f) -> ('a -> 'f)
would specialize to, in this instance:
(int -> foo -> float -> 'f, ..., 'f) format6
-> ((int -> foo -> float -> 'f) -> 'f)
which is not what we want, as there is no room for the extra
read_foo parameter.
The argument expected by scanf after the format string
is not just the continuation anymore, we need an extra argument for each
user-defined reader. Yet those extra arguments should not
appear in the type of the continuation.
What we need to do is to express the relation between the simple type
'a -> 'f (from continuation to result), which is the
return type of scanf fmt in absence of user-defined
readers, and the real return type of scanf. This is what
our last two parameters do:
the second-last parameter,
'e, is forced byscanffunctions to be equal to'a -> 'fthe parameter
'dis the actual type ofscanf fmtthe typing of format strings will generate constraints on the relation between
'dand'e, depending on the number of%rpresent in the format; without any%rthey are equal.
Consider our running example without any %r:
# ( "%d) %s -> %f" : (_, _, _, _, _, _) format6 );;
- : (int -> string -> float -> 'f,
'b, 'c, 'd, 'd, 'f) format6
The two parameters I’m talking about are constrained to be equal to
the same variables 'd. On the contrary, if we use
%r instead of %s:
# ( "%d) %r -> %f" : (_,_,_,_,_,_) format6 );;
- : (int -> 'x -> float -> 'f,
'b, 'c, ('b -> 'x) -> 'e, 'e, 'f) format6
Suddenly they are distinct, with one equal to the unknown
'e (that will be forced to equal
(... -> 'f) -> 'f when applying scanf),
and the other is ('b -> 'x) -> 'e: it adds on top of
'e an extra parameter (the user-defined format) which must
return a 'x (if I pass the reader read_foo,
this will be equated to foo) and takes as input a
'b, which is the type of the extra parameter of
user-defined printers or readers (scanf will set it to
Scanning.in_channel).
Benoît’s GADT declarations are surprisingly efficient at showing you precisely what happens to the type parameters when you add a new conversion description to an existing format string. In particular, the constructor for user-defined readers reads as follow:
| Reader : (* %r *)
('a, 'b, 'c, 'd, 'e, 'f) fmt ->
('x -> 'a, 'b, 'c, ('b -> 'x) -> 'd, 'e, 'f) fmtYou can very well see that an argument of type 'x (the
result type of the user-defined result) is added to the parameter
'a (the type of the continuation finally passed to
scanf), while an argument of type
('b -> 'x) is added to 'd (the type of the
arguments following the format string).
.
PS: Extra credit for the bonus question: why do I use
(fmt : (_, _, _, _, _, _) format6) instead of the more
convenient to write format_of_string fmt?