Step 1: Namespaces as plain compilation environments

The core concept of namespace management is to determine to which compilation unit -- as a (set of) filesystem path(s) -- a given OCaml identifier refers to.

Any mechanism for associating compilation units to OCaml module references would work; the details of how to specify this association vary over the different namespace proposals. Just imagine that the OCaml compiler is passed a compilation environment E, built by the user (with possible help from the library providers) that maps OCaml free module names to compilation units.

The type-checking pass accesses the .cmi of the external compilation units to type-check the current unit -- this also collects the external dependencies. The code-production passes use the internal name of the external modules, obtained from the compilation units, to forge in-code reference to those units, to be resolved at link-time. Currently, only the module name is used as a reference: reference to ModFoo is translated into something like (Pgetglobal "CamlModFoo"), but I suggested adding a random seed to avoid link-time conflicts, that would be stored in the compilation unit, so we would have something like "CamlModFoo$".

Flat compilation environments solve the main problem that we have with the current convention of -I directory includes plus "the filename is the module name", that is, the difficulty to disambiguate between two different .cmi of the same basename. If we allow the user to specify the environment of her choice, she can give name FooM to "foo/m.cmi" and BarM to "bar/m.cmi" instead of having to refer them by the ambiguous name M. Mission accomplished.

Step 2: Hierarchical namespaces, and `open`

The compilation environments of step 1 were just flat mappings. One may wish namespaces to have a hierarchical structure: instead of a simple {name ↦ compunit}* mapping, we have a recursive structure

compenv ::= {name ↦ (compunit | compenv)}*

Compilation units can now be referred to by complete paths Foo.Bar.Baz, rather than simply free module variables Foo, in OCaml sources. This also allows to define an open construct: open E enriches the compilation environment with the subenvironment rooted at E (or fails if E is not bound in the current env.).

For example, in the following environment E:

{
  A ↦ "foo/a"
  B ↦ "foo/b"
  S ↦ {
    C ↦ "bar/c"
    D ↦ "tmp/d"
  }
}

(By "foo/a" we designate the compilation unit whose interface is "foo/a.cmi" in the filesystem; we can assume that by convention the implementation unit are located at "foo/a.cmo" and "foo/a.cmx", or specify all independent paths in the environment mapping; details, details... They are more details to be considered, eg. whether the paths are absolute or relative to some -I-included directories, etc., but those are accessory.)

open S results in the following environment E':

{
  A ↦ "foo/a"
  B ↦ "foo/b"
  S ↦ {
    C ↦ "bar/c"
    D ↦ "tmp/d"
  }
  C ↦ "bar/c"
  D ↦ "tmp/d"
}

The hierarchical structure allows for more flexible construction of environments. Imagine that distributors of the libraries Foo and Bar both provide their environments E1 and E2 having some conflict (eg. both use the names List, Array and Queue to refer to some compilation unit of theirs), the user can build the environment:

{
  Foo ↦ E1
  Bar ↦ E2
}

and disambiguate accesses with Foo.List and Bar.Queue, while still being able to open Foo for local convenience.

Step 3: Namespaces as modules, subsuming `pack`

How do we explain environments and subenvironments to the user? What is the difference between the hierarchical access S.C to some compilation unit (found in the environment), and the access A.M to the submodule M of compilation unit A (found in the environment)?

We can explain the difference, that namespaces and modules are separate concepts, with modules being for programming "at the package level", and namespaces are more for "in the large" where code comes from different distributors and the user must be able to disambiguate.

But we can also pretend that compilation environments are just like modules. We can make it appear that the subenvironment S of the previous example is just like a module containing submodules C and D.

{
  A ↦ "foo/a"
  B ↦ "foo/b"
  S ↦ {
    C ↦ "bar/c"
    D ↦ "tmp/d"
  }
}

The idea is to ask the type-checker to translate all references to S in the source file into the following module expression: (struct module C = "bar/c.cmo" module D = "tmp/d.cmo" end)

What we really mean by "bar/c.cmo" in the OCaml code is to use the global mangled name that will correspond to the compilation unit after linking, something like (Pgetglobal "camlC$"). We consider that filesystem paths in the namespaces unambiguously denote compilation units; for a path "foo" in the namespace, we will write, in OCaml source code, "foo.cmo" to denote the component name, and "foo.cmi" to denote its signature (we could have written module C : "bar/c.cmi" = "bar/c.cmo"); it may actually be layout in a different way in the file system, the point is to refer to external compilation units.

The user can write for example in its source code:

S.C.print "Hello"

module M = S

module N = struct
  include S
  let foo = ()
end

And this will get translated into (modulo eventual names freshening):

module X = struct
  module C = "bar/c.cmo"
  module D = "tmp/d.cmo"
end

X.C.print "Hello"

module M = X

module N = struct
  include X
  let foo = ()
end

Note that the previous semantics for subenvironment access, at step 2 (S.C refers to the unit "bar/c"), is compatible with the current semantics of eta-expansion; we can continue translating S.C into "bar/c.cmo", this is just optimizing the projection (struct module C = "bar/c.cmo" ... end).C by partial evaluation.

In particular, writing an additional (module D = "tmp/d") here will not provoke side-effects: effects of "tmp/d.cmo" are not evaluated at use site, but at initialization of the program if "tmp/d.cmo" is linked. The difference is that if "tmp/d" appears in the translation, it will be considered as a dependency of the current compilation unit, and the linker will ask for it to be present (so we are forced to have it initialized); we can still specialize subenvironment accesses, and document which uses of environments-as-modules add dependencies on their subenvironments, and which don't.

This functionality subsumes pack: instead of helping the library distributor build a big .cmo packing several compilation units, we give users the power to designate by a single name a set of compilation units; the "bundling" is done by the typer's translation, at use site, rather than at compilation site.

This has the potential advantage that users don't need to link against a big packed compiled file: if the subenvironment S is huge, but the user only accesses a few submodules S.A, S.B, without using S as a module directly, we don't need to link the whole thing. So we get the ability to have hierarchies of modules without executable size blowup – which was also a desired feature of some users.

Optional:

One other interesting aspect that I haven't explored much – the idea, as most other ideas in the integration of -pack and namespaces, comes from Nicolas Pouillard — is the possibility to use environments, on the developer side, as specifications of how to produce compiled compilation units. The develop-per would ask: "from the source files at hand here, produce me the subenvironment S", instead of manually providing appropriate -pack and -for-pack options as done today. Users have complained that, with -pack and its minions, the semantics of OCaml program increasingly get determined by obscure command-line options. One could see the environments — from the developer side — as a different way to specify packing, directed by a "source" of sort, if one considers an environment specification as part of the OCaml program source.

Optional : flat access namespaces

One possibly nice feature that could be added in this context is the concept of "flat access namespace", which is a boolean flag on each subenvironment; if the submodule A of the environment P is marked as "flat access", its components P.A.C can also be accessed as P.C; this works when 'C' is a subenvironment name, but also an OCaml identifier (type, value, submodule...) in the case where P.A points to a compilation unit.

For example, if you have:
{ ...
S ↦ {
  Init ↦ "s/init" (flat access)
  C ↦ "bar/c"
  D ↦ "tmp/d"
  Foo ↦ "foo" (flat access)
 }
...
}
S in a module context would translate to
(struct
module Init = "s/init.cmo"
include Init
module C = "bar/c.cmo"
module D = "tmp/d.cmo"
module Foo = "foo.cmo"
include Foo
end)
and, correspondingly, open S would have the effect of opening the module "s/init". This gadget effect has been mentioned or requested by some people already. It makes packed environments more "complete", in that their images as modules are all the possible OCaml modules, and not only modules having only submodules component.

A natural consequence of this idea is that "flat access" modules present at the root of the environment would be automatically opened at the beginning of the program. This gives a proper semantics to the Pervasives auto-open feature, that is controllable by the user.

Flat access modules are included in an unspecified order. In case several modules are marked as flat-access, their sequential inclusion may raise a typing error, for example if two of them define a type of the same name; this behaves as specified by OCaml include, and the presence of an error does not depend on the order, even if the displayed error message may change slightly.

An error will also arise -- when type-checking the elaborated code -- if the flat-access module defines submodules with the same name as some of the subenvironment. This means the user should be careful when merging unrelated environments that have flat access modules; happily the conflicting subenvironments can always be rewritten/transformed on the user side to avoid conflicts, such is the strength of namespaces.

Step 4: parametrized compilation units, accommodating `functor-pack`

Can we subsume functor-pack? The answer is "yes, but the technical details are as subtle and difficult and invasive as functor-pack was in the first place". We still hope to provide a high-level view to the user.

The idea of functor-pack is that if you have modules B and C that refers to some module A, you may want to suddenly decide that you wish them to be parametric in different implementations of A. Or you may want to develop a functor over the interface a.cmi that you would like to split in several files. You write some magic along the lines of:

ocamlc -functor a.cmi -c b.ml -o b.cmo
ocamlc -functor a.cmi -c c.ml -o c.cmo
ocamlc -pack-functor F b.cmo c.cmo -o f.cmo

And voila, f.cmo contains a functor F that takes as parameter a module of signature "a.cmi", and returns two submodules B and C as you would expect. This is implemented in a patch by Fabrice Le Fessant.

Note that this works even if the module C depends on B: accesses to B from C are not compiled as normal access to an external module, but as access to a module component that will be (somehow) passed as parameter to the functor. We will call such dependencies "functorized dependencies".

To be able to functorize modules with cross-dependencies, Fabrice's patch enriches the .cmi and .cmo files of -functor-ized compilation unit with a new information: "functor parts" (or, in mixin parlance, "imports"). There are stored the (link-time) name and signature of the dependencies that are themselves functorized. In the example above, B would have A as its functor_parts, and C would have both A and B. But a dependency to the standard library, which is not functorized over anything, would be compiled as a hard dependency and not appear as a functorized import.

Expansion with dependencies

In the namespace setting, we can reuse the expansion idea. We will suppose we have the following environment to describe this exemplar situation:

{
  F("a") ↦ {
    B ↦ "b"
    C ↦ "c"
  }
}

We reuse the "functor_parts" metadata to automatically provide the functorized dependencies, during expansion. More precisely, we ask that the modules compiled with -functor option:

register in a new "functor_parts" metadata component their parameter, but also their functorized dependencies -- the dependencies on external modules whose .cmi have a non-empty list of functor parts.
store as compiled code a single toplevel phrase: a Make module taking all those functorparts modules as parameters, in same order as its "functorparts"
register the signature of the module body (corresponding to their .mli; in mixin parlance, the "export") in the signature field of the .cmi, instead of writing the signature of the functorized module corresponding to the compiled code.

We can then expand uses of F in a well-typed way, accounting for the functorized dependencies, by inspecting the .cmi and its "functor_parts" metadata at expansion time:

functor(M : "a.cmi") -> struct
  module B : "b.cmi" = "b.cmo".Make(M)
  module C : "c.cmi" = "c.cmo".Make(M)(B)
end

Namespace-as-module expansion is now doing the boring work of plumbing inter-dependencies of the functor components. Users have complained about the need to do this manually for some time, see this 2008 thread by Yaron Minsky.

Optional:

Note that, as with simple -pack, it would be interesting to see if this environment provides enough information to the compiler to produce functorized compilation units, instead of simply referencing them from the library user side. This would allow developers to use functor-packing with less fiddling with command-line options.

We could, as we did with the simple module-packing of Step 3, do a partial evaluation: when using F(M).B directly, it would make sense to specialize the translation not to build and depend on C. On the contrary, when using F(M).C, B is a dependency and must be built. As this specialization of module-packs, this could change the recorded dependencies of a module -- and thus impose linking some modules; but this could also have observable side-effects at each use site, by not evaluating the effects of (in the F(M).B example) "c.cmo".Make(M).(B). Again, it is important the contexts that trigger specialization be well-defined and natural to the programmer.

Submodule effects and link-time .cmo order

There remains a small but painful detail to handle here: the order in which the module dependencies are evaluated. Suppose we have a second module C2 which depend on B, and a module D depending on C and C2. Do we initialize B, C, C2 then D or B, C2, C then D ?

The order of expansion didn't matter in the Step 3 setting as they had no observable effect: all effects had been performed when initializing the whole program, depending on the linking order. As we reproduce the linking logic here, we must also decide on the "linking order" of the instantiation.

In Fabrice's patch setting, the order is given by the linking order at -pack time: when the user compiles his final functor with

ocamlc -pack b.cmo c2.cmo c.cmo d.cmo -o bigfunctor.cmo

The user imposes that effects of c2 be evaluated before effects of c. But as we expand directly in user code without requesting any precompilation of "the pack", we don't have access to such ordering information.

The choice of "least surprise" to the user would be to reuse the order in which the functorized modules will be linked to produce the final executable -- even if the linking of c2.cmo and c.cmo have no effect by themselves, as they evaluate to big functors whose effects are delayed to instantiation time.

We propose (yet another subtlety) to record at program initialization time the order in which the functorized modules where linked. Suppose we have a global list "revlinkedmodules" and compile eg. c.ml into

  rev_linked_modules := "c" :: !rev_linked_modules;
  module Make(M:"a")(B:"b") = struct ... end

At eta-expansion time we can produce code that will, at runtime, check linked_modules to initialize the different components in the expected order. The typed translation for this is a bit convoluted, but you can't get "le beure et l'argent du beurre":

functor(M) -> struct
  let b = ref None in
  let c = ref None in
  let c2 = ref None in
  let d = ref None in
  let get x = match !x with
  | Some v -> v
  | None ->
    (* consistent linking order has been checked by the linker *)
    assert false in
  let init = function
  | "b" ->
     b := Some (module "b.cmo".Make(M) : "b.cmi")
  | "c" ->
     let module B = (val (get b) : "b.cmi") in
     c := Some (module "c.cmo".Make(M)(B) : "c.cmi")
  | ... (* likewise for "c2", "d" *)
  in
  Liter.iter init (List.rev !rev_linked_modules);
  module B = (val (get b) : ...)
  module C = (val (get c) : ...)
  ...
end

You most probably did not want to know about this, but you can forget it all and just remember that we considered evaluation order issues and consistency with the previous behavior.

Recursive modules

Note that this translation would accommodate extension of -functor to mutually recursive modules. Imagine we have the environment: { F("a")↦ { B↦"b", C↦"c" } } with B and C recursively depending on each another. The both would have A, B and C as their "functor_parts" metadata, and F in user code would expand to (we drop the linking-order madness for the sake of this example):

functor(M : "a") -> struct
  module rec B : "b" = "b".Make(M)(B)(C)
         and C : "c" = "c".Make(M)(B)(C)
end

This will of course follow the rules set for recursive modules (existence of a "safe" module, etc.), and be robust to any change of those rules.

Named parameters

While we've been elusive about how compilation environments are built, there is one operation that we are very interested in: the deep, hierarchical merge of two environments : merging the mapping to compilation units, and recursively merging the subenvironments having the same name.

This operation still works well at step 3, when we consider packing environment as modules. However, it doesn't work as well for functorized environment: what would it mean to merge two subenvironments depending on different signatures, F("a1")↦{B1↦"b1"} and F("a2")↦{B2↦"b2"}? Currified functor parameters? In which order?

To help painless merging of arbitrary environments, we may explicitly pass parameters by name, rather than by position. We would write in an environment:

F(A1:"a1", A2:"a2") ↦ {B1↦"b1", B2↦"b2"}

and that would generate the following code (again forgetting about link-time ordering issues), assuming "b1" only depends on "a1" and "b2" on "a2":

functor(M : sig
  module A1 : "a1.cmi"
  module A2 : "a2.cmi"
end) -> struct
  module B1 : "b1.cmi" = "b1.cmo".Make(M.A1)(B1)
  module B2 : "c2.cmi" = "b2.cmo".Make(M.A2)(B2)
end

The user would then have to write explicitly:

F(struct module A1 = ... module A2 = ... end)

Merging functors with named parameters is easy, if the parameters names are distinct or the common names use the same signature on each side. If two environments request two different signatures for an argument with the same name, merging them and using the packed functor in user code will fail type-checking. That is a lesser evil, as environments are in total control of the user: she can change them to avoid the name conflict.