The 6 parameters of (’a, ’b, ’c, ’d, ’e, ’f) format6
- April 18, 2014
The infamous format6
type is the basis of the hackish
but damn convenient, type-safe way in which OCaml handles
format strings:
Printf.printf "%d) %s -> %f\n"
3 "foo" 5.12
The first argument of printf
in this example is not a
string, but a format string (they share the same literal syntax, but
have different type, and there is a small hack in the type-inference
engine to make this possible). It’s type, which you can get by asking
("%d) %s -> %f\n" : (_,_,_,_,_,_) format6)
in a
toplevel, is a mouthful (and I renamed the variables for better
readability):
(int -> string -> float -> 'f, 'b, 'c, 'd, 'd, 'f) format6
What do those six arcane parameters mean? In the process of reviewing Benoît Vaugon’s work on using GADTs to re-implement format functions, I’ve finally felt the need (after years of delighted oblivion) to understand each of those parameters. And that came after several days of hacking on easier-to-understand ten-parameters GADT functions; hopefully most of our readers will never need this information, but the probability that I need to refresh my memory in the future is strong enough for me to write this blog post.
The main take-away is that this type slowly evolved, from a more
reasonable ('a, 'b, 'c) format
type, to the monster that
lurks in the documentation today. Advanced features of format strings
needed more and more parameters to carry their type information. It is
possible to implement format strings with less features and simpler
types, and it may be possible to implement the same format strings with
types that are easier to understand, but I need to understand the
interface as it exists.
I built my own understanding, not by reading the documentation (though I tried that first), but by imagining layers of features, from the simple to the advanced. This is this post’s approach. It does necessarily correspond to the actual chronological evolution of format types – Pierre Weis would know about that.
At the beginning were
printf
and sprintf
# Printf.printf "%d) %s -> %f\n";;
- : int -> string -> float -> unit = <fun>
# Printf.sprintf "%d) %s -> %f\n";;
- : int -> string -> float -> string = <fun>
Those two functions differ by their return type: one prints and
returns nothing (unit
), the other returns a
string
. This parameter is stored in the sixth-parameter,
'f
, and is called “result type for the printf-like
functions” in the documentation.
Of course, the format string does not need to impose any particular
value to this parameter: a given format string can be used by a function
with any possible result type, hence the undetermined variable
'f
in the format type above. It is the printf
functions that force a precise instantiation of the parameter:
Printf.printf
has a type of the form
('a, ..., unit) format6 -> 'a
.
The type 'a
mentioned in the type of printf
is a large function type, with one parameter for each conversion
specification (%d
, %s
, etc.), and whose return
type is enforced by the format to be equal to the last parameter of the
format. In the case of "%d) %s -> %f\n"
, this type is
int -> string -> float -> 'a
, where
'a
is the same variable present in the last parameter of
the format – the result type. This type depends only on the format
string, not at all on the printing (or scanning) function used.
To summarize, in ('a, 'b, 'c, 'd, 'e, 'f) format6
, we’ve
seen that:
'f
is what you get after you pass all the parameters to a printf-like function'a
is the big function type that expects all those parameters and returns'f
'b
and
'c
for user-defined printers:
You may know about %a
and the lesser-known (but actually
simpler) %t
, used as follow:
Printf.sprintf "%d) %a -> %f"
3
print_foo foo
6.23
The idea is that instead of converting the value foo
to
a string first, you can, at printf time, pass a “user-defined printer”
print_foo
as an extra argument that tells how to convert
foo
into a string. This is easy enough to handle
typing-wise: the format %a adds two parameters to the big function type,
instead of just one.
In a non-existing ideal world, print_foo
used in
sprintf
would have the convenient type
foo -> string
. However, a single %a
must
work with many printing function: for sprintf
, the
user-defined should return a string, for printf
it should
just print and return unit
, with fprintf
it
needs to be passed the channel in which we’re printing, or with
bprintf
the target buffer. So this determines two type
parameters in ('a, 'b, 'c, 'd, 'e, 'f) format6
:
'b
is the type of the extra information that user-defined printers may need (Buffer.t
forbprintf
,output_channel
forfprintf
, etc.)'c
is the return type of the user-defined printers accepted by the function (string
forsprintf
,unit
for most others).
There are two remarks. First, it happens that printf
use
out_channel
as second parameter, while you would probably
expect unit
. This is strange API design, but the rationale
is to let you reuse your user-defined printers for both
printf
and fprintf
.
Second, you may remark that the 'c
parameter looks
suspiciously like “the result type of printf-lie functions”,
'f
. Indeed, all current printf
-like functions
use, to my knowledge, the same type for 'f
and
'c
. But it would be very reasonable, for performance
reasons, to design a different version of sprintf
that
returns a string, but whose user-defined sub-printers put result in a
Buffer.t
parameter and thus just return
unit
.
Scanf is just a continuation away
Consider the difference between printing and scanning:
Printf.printf "%d) %s %f"
3 "foo" 7.23;;
Scanf.scanf "%d) %s %f"
(fun num name value -> ....)
When printing with “%d) %s %f”, the user has to provide an int
parameter, a string parameter, and a float parameter, hence the type of
the form int -> string -> float -> ...
. With
scanning, you have to give them to the user, which looks like a
fairly different activity. How can you reuse the same typing of the
format string?
Fortunately, expressing scanning with a continuation (the function
(fun num name value -> ...)
) allows precisely to reuse a
type of the form int -> string -> float -> ...
to
describe scanning: just say that for any type 'f
, if the
user gives a int -> string -> float -> 'f
,
scanf
will return a 'f
.
In a simple world, the type of scanf
would thus be of
the following form:
('a, ..., 'f) format6 -> 'a -> 'f
(where it is understood that for any format string, 'a
will actually be of the form
t1 -> t2 -> ... -> 'f
)
(Alternate design idea: have a new type parameter that stores,
instead of a big arrow type, the product of all parameter types
(int * string * float
in our example), and make
scanf
return this directly instead of using a
continuation-passing style. The present style has the property of being
easier to implement with GADTs, but that probably wasn’t a design
consideration.)
In fact, the typing of scanf
is a bit more complicated,
as explained in the next section.
'd
and
'e
for user-defined readers
Just as user-defined printers, to write formats in a more high-level
style it is possible to provide user-defined scanners with
%r
:
let read_foo scanbuf = ...
Scanf.scanf "%d) %r -> %f"
read_foo
(fun n foo x -> ...)
Now the typing implications of this for format strings are a bit
tricky. As we read three values of type int
,
foo
and float
, the naive typing of
scanf
proposed earlier
('a, ..., 'f) -> ('a -> 'f)
would specialize to, in this instance:
(int -> foo -> float -> 'f, ..., 'f) format6
-> ((int -> foo -> float -> 'f) -> 'f)
which is not what we want, as there is no room for the extra
read_foo
parameter.
The argument expected by scanf
after the format string
is not just the continuation anymore, we need an extra argument for each
user-defined reader. Yet those extra arguments should not
appear in the type of the continuation.
What we need to do is to express the relation between the simple type
'a -> 'f
(from continuation to result), which is the
return type of scanf fmt
in absence of user-defined
readers, and the real return type of scanf
. This is what
our last two parameters do:
the second-last parameter,
'e
, is forced byscanf
functions to be equal to'a -> 'f
the parameter
'd
is the actual type ofscanf fmt
the typing of format strings will generate constraints on the relation between
'd
and'e
, depending on the number of%r
present in the format; without any%r
they are equal.
Consider our running example without any %r
:
# ( "%d) %s -> %f" : (_, _, _, _, _, _) format6 );;
- : (int -> string -> float -> 'f,
'b, 'c, 'd, 'd, 'f) format6
The two parameters I’m talking about are constrained to be equal to
the same variables 'd
. On the contrary, if we use
%r
instead of %s
:
# ( "%d) %r -> %f" : (_,_,_,_,_,_) format6 );;
- : (int -> 'x -> float -> 'f,
'b, 'c, ('b -> 'x) -> 'e, 'e, 'f) format6
Suddenly they are distinct, with one equal to the unknown
'e
(that will be forced to equal
(... -> 'f) -> 'f
when applying scanf
),
and the other is ('b -> 'x) -> 'e
: it adds on top of
'e
an extra parameter (the user-defined format) which must
return a 'x
(if I pass the reader read_foo
,
this will be equated to foo
) and takes as input a
'b
, which is the type of the extra parameter of
user-defined printers or readers (scanf
will set it to
Scanning.in_channel
).
Benoît’s GADT declarations are surprisingly efficient at showing you precisely what happens to the type parameters when you add a new conversion description to an existing format string. In particular, the constructor for user-defined readers reads as follow:
| Reader : (* %r *)
('a, 'b, 'c, 'd, 'e, 'f) fmt ->
('x -> 'a, 'b, 'c, ('b -> 'x) -> 'd, 'e, 'f) fmt
You can very well see that an argument of type 'x
(the
result type of the user-defined result) is added to the parameter
'a
(the type of the continuation finally passed to
scanf
), while an argument of type
('b -> 'x)
is added to 'd
(the type of the
arguments following the format string).
.
PS: Extra credit for the bonus question: why do I use
(fmt : (_, _, _, _, _, _) format6)
instead of the more
convenient to write format_of_string fmt
?