The core concept of namespace management is to determine to which compilation unit -- as a (set of) filesystem path(s) -- a given OCaml identifier refers to.
Any mechanism for associating compilation units to OCaml module references would work; the details of how to specify this association vary over the different namespace proposals. Just imagine that the OCaml compiler is passed a compilation environment E, built by the user (with possible help from the library providers) that maps OCaml free module names to compilation units.
The type-checking pass accesses the .cmi of the external compilation
units to type-check the current unit -- this also collects the
external dependencies. The code-production passes use the internal
name of the external modules, obtained from the compilation units, to
forge in-code reference to those units, to be resolved at
link-time. Currently, only the module name is used as a reference:
reference to ModFoo is translated into something like
(Pgetglobal "CamlModFoo"), but I suggested adding a random seed to
avoid link-time conflicts, that would be stored in the compilation
unit, so we would have something like "CamlModFoo$
Flat compilation environments solve the main problem that we have with
the current convention of -I
directory includes plus "the filename
is the module name", that is, the difficulty to disambiguate between
two different .cmi of the same basename. If we allow the user to
specify the environment of her choice, she can give name FooM
to
"foo/m.cmi" and BarM
to "bar/m.cmi" instead of having to refer them
by the ambiguous name M
. Mission accomplished.
open
The compilation environments of step 1 were just flat mappings. One may wish namespaces to have a hierarchical structure: instead of a simple {name ↦ compunit}* mapping, we have a recursive structure
compenv ::= {name ↦ (compunit | compenv)}*
Compilation units can now be referred to by complete paths
Foo.Bar.Baz
, rather than simply free module variables Foo
, in
OCaml sources. This also allows to define an open
construct:
open E
enriches the compilation environment with the subenvironment
rooted at E (or fails if E is not bound in the current env.).
For example, in the following environment E:
{
A ↦ "foo/a"
B ↦ "foo/b"
S ↦ {
C ↦ "bar/c"
D ↦ "tmp/d"
}
}
(By "foo/a" we designate the compilation unit whose interface is "foo/a.cmi" in the filesystem; we can assume that by convention the implementation unit are located at "foo/a.cmo" and "foo/a.cmx", or specify all independent paths in the environment mapping; details, details... They are more details to be considered, eg. whether the paths are absolute or relative to some -I-included directories, etc., but those are accessory.)
open S
results in the following environment E':
{
A ↦ "foo/a"
B ↦ "foo/b"
S ↦ {
C ↦ "bar/c"
D ↦ "tmp/d"
}
C ↦ "bar/c"
D ↦ "tmp/d"
}
The hierarchical structure allows for more flexible construction of
environments. Imagine that distributors of the libraries Foo and Bar
both provide their environments E1
and E2
having some conflict
(eg. both use the names List
, Array
and Queue
to refer to some
compilation unit of theirs), the user can build the environment:
{
Foo ↦ E1
Bar ↦ E2
}
and disambiguate accesses with Foo.List and Bar.Queue, while still
being able to open Foo
for local convenience.
pack
How do we explain environments and subenvironments to the user? What is the difference between the hierarchical access S.C to some compilation unit (found in the environment), and the access A.M to the submodule M of compilation unit A (found in the environment)?
We can explain the difference, that namespaces and modules are separate concepts, with modules being for programming "at the package level", and namespaces are more for "in the large" where code comes from different distributors and the user must be able to disambiguate.
But we can also pretend that compilation environments are just like
modules. We can make it appear that the subenvironment S
of the
previous example is just like a module containing submodules C
and
D
.
{
A ↦ "foo/a"
B ↦ "foo/b"
S ↦ {
C ↦ "bar/c"
D ↦ "tmp/d"
}
}
The idea is to ask the type-checker to translate all references to S
in the source file into the following module expression:
(struct
module C = "bar/c.cmo"
module D = "tmp/d.cmo"
end)
What we really mean by "bar/c.cmo" in the OCaml code is to use the
global mangled name that will correspond to the compilation unit after
linking, something like (Pgetglobal "camlC$module C : "bar/c.cmi" = "bar/c.cmo"
); it may actually be layout in
a different way in the file system, the point is to refer to external
compilation units.
The user can write for example in its source code:
S.C.print "Hello"
module M = S
module N = struct
include S
let foo = ()
end
And this will get translated into (modulo eventual names freshening):
module X = struct
module C = "bar/c.cmo"
module D = "tmp/d.cmo"
end
X.C.print "Hello"
module M = X
module N = struct
include X
let foo = ()
end
Note that the previous semantics for subenvironment access, at step 2 (S.C refers to the unit "bar/c"), is compatible with the current semantics of eta-expansion; we can continue translating S.C into "bar/c.cmo", this is just optimizing the projection (struct module C = "bar/c.cmo" ... end).C by partial evaluation.
In particular, writing an additional (module D = "tmp/d") here will not provoke side-effects: effects of "tmp/d.cmo" are not evaluated at use site, but at initialization of the program if "tmp/d.cmo" is linked. The difference is that if "tmp/d" appears in the translation, it will be considered as a dependency of the current compilation unit, and the linker will ask for it to be present (so we are forced to have it initialized); we can still specialize subenvironment accesses, and document which uses of environments-as-modules add dependencies on their subenvironments, and which don't.
This functionality subsumes pack
: instead of helping the library
distributor build a big .cmo packing several compilation units, we
give users the power to designate by a single name a set of
compilation units; the "bundling" is done by the typer's translation,
at use site, rather than at compilation site.
This has the potential advantage that users don't need to link against a big packed compiled file: if the subenvironment S is huge, but the user only accesses a few submodules S.A, S.B, without using S as a module directly, we don't need to link the whole thing. So we get the ability to have hierarchies of modules without executable size blowup – which was also a desired feature of some users.
Optional:
One other interesting aspect that I haven't explored much – the idea, as most other ideas in the integration of -pack and namespaces, comes from Nicolas Pouillard — is the possibility to use environments, on the developer side, as specifications of how to produce compiled compilation units. The develop-per would ask: "from the source files at hand here, produce me the subenvironment
S
", instead of manually providing appropriate -pack and -for-pack options as done today. Users have complained that, with -pack and its minions, the semantics of OCaml program increasingly get determined by obscure command-line options. One could see the environments — from the developer side — as a different way to specify packing, directed by a "source" of sort, if one considers an environment specification as part of the OCaml program source.
One possibly nice feature that could be added in this context is the concept of "flat access namespace", which is a boolean flag on each subenvironment; if the submodule A of the environment P is marked as "flat access", its components P.A.C can also be accessed as P.C; this works when 'C' is a subenvironment name, but also an OCaml identifier (type, value, submodule...) in the case where P.A points to a compilation unit.
For example, if you have:
{ ... S ↦ { Init ↦ "s/init" (flat access) C ↦ "bar/c" D ↦ "tmp/d" Foo ↦ "foo" (flat access) } ... }
S
in a module context would translate to(struct module Init = "s/init.cmo" include Init module C = "bar/c.cmo" module D = "tmp/d.cmo" module Foo = "foo.cmo" include Foo end)
and, correspondingly,
open S
would have the effect of opening the module "s/init". This gadget effect has been mentioned or requested by some people already. It makes packed environments more "complete", in that their images as modules are all the possible OCaml modules, and not only modules having only submodules component.A natural consequence of this idea is that "flat access" modules present at the root of the environment would be automatically opened at the beginning of the program. This gives a proper semantics to the
Pervasives
auto-open feature, that is controllable by the user.Flat access modules are included in an unspecified order. In case several modules are marked as flat-access, their sequential inclusion may raise a typing error, for example if two of them define a type of the same name; this behaves as specified by OCaml
include
, and the presence of an error does not depend on the order, even if the displayed error message may change slightly.An error will also arise -- when type-checking the elaborated code -- if the flat-access module defines submodules with the same name as some of the subenvironment. This means the user should be careful when merging unrelated environments that have flat access modules; happily the conflicting subenvironments can always be rewritten/transformed on the user side to avoid conflicts, such is the strength of namespaces.
functor-pack
Can we subsume functor-pack? The answer is "yes, but the technical details are as subtle and difficult and invasive as functor-pack was in the first place". We still hope to provide a high-level view to the user.
The idea of functor-pack is that if you have modules B
and C
that
refers to some module A
, you may want to suddenly decide that you
wish them to be parametric in different implementations of A
. Or you
may want to develop a functor over the interface a.cmi
that you
would like to split in several files. You write some magic along the
lines of:
ocamlc -functor a.cmi -c b.ml -o b.cmo
ocamlc -functor a.cmi -c c.ml -o c.cmo
ocamlc -pack-functor F b.cmo c.cmo -o f.cmo
And voila, f.cmo contains a functor F that takes as parameter a module of signature "a.cmi", and returns two submodules B and C as you would expect. This is implemented in a patch by Fabrice Le Fessant.
Note that this works even if the module C depends on B: accesses to B from C are not compiled as normal access to an external module, but as access to a module component that will be (somehow) passed as parameter to the functor. We will call such dependencies "functorized dependencies".
To be able to functorize modules with cross-dependencies, Fabrice's patch enriches the .cmi and .cmo files of -functor-ized compilation unit with a new information: "functor parts" (or, in mixin parlance, "imports"). There are stored the (link-time) name and signature of the dependencies that are themselves functorized. In the example above, B would have A as its functor_parts, and C would have both A and B. But a dependency to the standard library, which is not functorized over anything, would be compiled as a hard dependency and not appear as a functorized import.
In the namespace setting, we can reuse the expansion idea. We will suppose we have the following environment to describe this exemplar situation:
{
F("a") ↦ {
B ↦ "b"
C ↦ "c"
}
}
We reuse the "functor_parts" metadata to automatically provide the functorized dependencies, during expansion. More precisely, we ask that the modules compiled with -functor option:
register in a new "functor_parts" metadata component their parameter, but also their functorized dependencies -- the dependencies on external modules whose .cmi have a non-empty list of functor parts.
store as compiled code a single toplevel phrase: a Make
module
taking all those functorparts modules as parameters, in same order
as its "functorparts"
register the signature of the module body (corresponding to their .mli; in mixin parlance, the "export") in the signature field of the .cmi, instead of writing the signature of the functorized module corresponding to the compiled code.
We can then expand uses of F
in a well-typed way, accounting for the
functorized dependencies, by inspecting the .cmi and its
"functor_parts" metadata at expansion time:
functor(M : "a.cmi") -> struct
module B : "b.cmi" = "b.cmo".Make(M)
module C : "c.cmi" = "c.cmo".Make(M)(B)
end
Namespace-as-module expansion is now doing the boring work of plumbing inter-dependencies of the functor components. Users have complained about the need to do this manually for some time, see this 2008 thread by Yaron Minsky.
Optional:
Note that, as with simple -pack, it would be interesting to see if this environment provides enough information to the compiler to produce functorized compilation units, instead of simply referencing them from the library user side. This would allow developers to use functor-packing with less fiddling with command-line options.
We could, as we did with the simple module-packing of Step 3, do a partial evaluation: when using F(M).B directly, it would make sense to specialize the translation not to build and depend on C. On the contrary, when using F(M).C, B is a dependency and must be built. As this specialization of module-packs, this could change the recorded dependencies of a module -- and thus impose linking some modules; but this could also have observable side-effects at each use site, by not evaluating the effects of (in the F(M).B example) "c.cmo".Make(M).(B). Again, it is important the contexts that trigger specialization be well-defined and natural to the programmer.
There remains a small but painful detail to handle here: the order in which the module dependencies are evaluated. Suppose we have a second module C2 which depend on B, and a module D depending on C and C2. Do we initialize B, C, C2 then D or B, C2, C then D ?
The order of expansion didn't matter in the Step 3 setting as they had no observable effect: all effects had been performed when initializing the whole program, depending on the linking order. As we reproduce the linking logic here, we must also decide on the "linking order" of the instantiation.
In Fabrice's patch setting, the order is given by the linking order at -pack time: when the user compiles his final functor with
ocamlc -pack b.cmo c2.cmo c.cmo d.cmo -o bigfunctor.cmo
The user imposes that effects of c2
be evaluated before effects of
c
. But as we expand directly in user code without requesting any
precompilation of "the pack", we don't have access to such ordering
information.
The choice of "least surprise" to the user would be to reuse the order in which the functorized modules will be linked to produce the final executable -- even if the linking of c2.cmo and c.cmo have no effect by themselves, as they evaluate to big functors whose effects are delayed to instantiation time.
We propose (yet another subtlety) to record at program initialization time the order in which the functorized modules where linked. Suppose we have a global list "revlinkedmodules" and compile eg. c.ml into
rev_linked_modules := "c" :: !rev_linked_modules;
module Make(M:"a")(B:"b") = struct ... end
At eta-expansion time we can produce code that will, at runtime, check linked_modules to initialize the different components in the expected order. The typed translation for this is a bit convoluted, but you can't get "le beure et l'argent du beurre":
functor(M) -> struct
let b = ref None in
let c = ref None in
let c2 = ref None in
let d = ref None in
let get x = match !x with
| Some v -> v
| None ->
(* consistent linking order has been checked by the linker *)
assert false in
let init = function
| "b" ->
b := Some (module "b.cmo".Make(M) : "b.cmi")
| "c" ->
let module B = (val (get b) : "b.cmi") in
c := Some (module "c.cmo".Make(M)(B) : "c.cmi")
| ... (* likewise for "c2", "d" *)
in
Liter.iter init (List.rev !rev_linked_modules);
module B = (val (get b) : ...)
module C = (val (get c) : ...)
...
end
You most probably did not want to know about this, but you can forget it all and just remember that we considered evaluation order issues and consistency with the previous behavior.
Note that this translation would accommodate extension of -functor to
mutually recursive modules. Imagine we have the environment:
{ F("a")↦ { B↦"b", C↦"c" } }
with B and C recursively depending on each another. The both would
have A, B and C as their "functor_parts" metadata, and F
in user
code would expand to (we drop the linking-order madness for the sake
of this example):
functor(M : "a") -> struct
module rec B : "b" = "b".Make(M)(B)(C)
and C : "c" = "c".Make(M)(B)(C)
end
This will of course follow the rules set for recursive modules (existence of a "safe" module, etc.), and be robust to any change of those rules.
While we've been elusive about how compilation environments are built, there is one operation that we are very interested in: the deep, hierarchical merge of two environments : merging the mapping to compilation units, and recursively merging the subenvironments having the same name.
This operation still works well at step 3, when we consider packing
environment as modules. However, it doesn't work as well for
functorized environment: what would it mean to merge two
subenvironments depending on different signatures, F("a1")↦{B1↦"b1"}
and F("a2")↦{B2↦"b2"}
? Currified functor parameters? In which order?
To help painless merging of arbitrary environments, we may explicitly pass parameters by name, rather than by position. We would write in an environment:
F(A1:"a1", A2:"a2") ↦ {B1↦"b1", B2↦"b2"}
and that would generate the following code (again forgetting about link-time ordering issues), assuming "b1" only depends on "a1" and "b2" on "a2":
functor(M : sig
module A1 : "a1.cmi"
module A2 : "a2.cmi"
end) -> struct
module B1 : "b1.cmi" = "b1.cmo".Make(M.A1)(B1)
module B2 : "c2.cmi" = "b2.cmo".Make(M.A2)(B2)
end
The user would then have to write explicitly:
F(struct module A1 = ... module A2 = ... end)
Merging functors with named parameters is easy, if the parameters names are distinct or the common names use the same signature on each side. If two environments request two different signatures for an argument with the same name, merging them and using the packed functor in user code will fail type-checking. That is a lesser evil, as environments are in total control of the user: she can change them to avoid the name conflict.