I recently started an internship in Gallium, working on the HeVeA LaTeX to HTML compiler from and with Luc Maranget. I'll stay here for two months, which is sufficient for having to "blog about it". In my civil life, I'm interested in about approximately everything concerning language(-related) theory.

My first job for HeVeA was changing the structure used by the Out module. It was yielding the ouput by concatenating together bare strings, thus blowing up the maximal size of strings on 32 bits machines. A data structure fit for concatenations was therefore needed, and a good choice for it is Ropes.


Basically, ropes are just balanced binary-trees, the leaves being strings. Each node is also weighted for example by the length of the rope which it is the root of (it could also be by the length of its left child, which makes no difference but administrative ones). The leaves are designed to be small enough to see the difference, and big enough to avoid too deep trees (which could have a negative effect on performances). Typically, the maximum length of leaves is 256, and the height is only limited by integer size in OCaml.

A rope for the string "Gagallium" could be :

   /     \
  5       4
 / \     / \
Ga gal  li um

On this data structure, we win on complexity against bare strings for most costly operations, while losing with O(log n) vs. O(1) for a few ones (like reading the character at a given position). Let's just link Wikipedia for the administrative details. Bottom line is, on big strings we positively gain something (more than simply being able to run HeVeA on old machines).

In HeVeA

The main place in HeVeA where this structure finds its use is in a file called out.ml. It provides functions to manipulate "buffers", which are called by the html-generating function to produce the final output. Before my intervention, the buffer type looked like :

type buff = {
  mutable buff : string;
  mutable bp : int;
  mutable len : int
;; (* In 1998 they wrote ;; *)

type t = Buff of buff | Chan of out_channel | Null

buff is the string containing the output, bp is the position of the cursor in this string, and len is String.length buff (probably for optimization reasons). When you wrote something to the buff, it is blitted at bp position, and when you retrieve something (for example to put it in the final output), you get all the part between the beginning and bp. Chan and Null are used for unknown (at my global understanding level) purposes and are not changed with ropes.

The new type looks like :

type t =
  | Rope of S.t ref
  | Chan of out_channel
  | Null

(* S being the rope module, providing type t for ropes *)

As HeVeA globally works a lot by mutability, I use a reference to a rope, which is an immutable structure. All I had to do then was changing the code of the module function by function, which was quite straightforward once I understood the exact meaning of the old code. Well, I still had strange bugs, mainly for reading too quickly the former version, but also for unexpected behaviours from external modules which I did not really know about (like Format from the standard library, which does strange things with formatters; you should read carefully about it before thinking you know how to use them).

The result

In conclusion, HeVeA is now 20% slower, which is quite a good result considering the optimizations which were done before (and the way we totally did not care about it when using ropes). It will probably stay as it is concerning this point, because it is quick enough for main purpose (namely less than one second on reasonable documents). But if it has to be sped up one day, it would be possible due to the stronger architecture of the Out module, and by using ropes on the other parts of the software, instead of converting from rope to (small) string and reciprocally on many places.