## Implementing HAMT for OCaml

- February 2, 2013

During the second part of my internship this summer, I worked on a test implementation for a new data structure in OCaml : Hash Array Mapped Tries (HAMT). I did not have the time to finish it, but I now have a usable and quite complete version, still to be more documented but already available at http://gitorious.org/ocaml-hamt.

On this article, I will quickly explain the principles and the use cases of this structure, and I will expose a few details of my implementation in OCaml.

### HAMT: Why, where, how ?

When you want to play with associative tables over non-integer keys, if you can use imperative constructions, you do not usually hesitate a long time : you use Hash Tables. This structure guarantees `O(1)`

complexity for all elementary operations, provided you can hash your values. However, it is by essence a mutable structure: if you want to use a functional style, you need persistent ones (and copying a Hash Table at every single insertion is obviously right out).

The usual choice is then to use balanced binary trees (like Maps in OCaml): if you can provide an order relation on your keys, elementary operations become `O(log(n))`

. In 2001, a new data structure is introduced by Phil Bagwell: Hash Array Mapped Tries^{1}. Where balanced binary trees have a branching factor of 2, thus having approximately a `log2(n)`

depth (`n`

is the number of elements), HAMT are prefix trees which have a much greater factor, namely 32 in the original article (in fact, the best value is the size of a word of your machine, we will see why).

But how to build *prefix* trees with your own type of key ? That's where hashing enters the scene. If you can hash your key, then you can consider that the keys of your table are not your key type, but the hashed values of your own keys. Thus you can build a prefix tree, with an obvious notion of prefix that I will not explain. There is also a second advantage: if your hash function is correctly distributed, your tree is automagically balanced: you do not it to set up by rebalancing sometimes nor by keeping extra information. Being a "prefix tree over the hashes of the keys" is therefore the basis of the structure.

HAMT have been getting publicity since they where picked as a central persistent datastructure in the Clojure programming language. We have also had encouraging feedback from use in Scala, and more recently implementations in Haskell. It seems that persistent HAMT are an interesting datastructure; yet we should not jump to conclusions, as the details of one's runtime system may have important impacts on the viability of subtle data layout choices. The purpose of the current prototype was to test the waters to see whether the "HAMT success story" could be reproduced in the OCaml world.

Its use case is rather obvious: when you need a persistent associative table over a key that you can hash, you can use HAMT. However, there is no notion of comparison in standard HAMT: if you need to be able to get quickly the value associated with the minimal key, they are not a good choice. So where is the advantage over Maps ?

Here is a small, early comparison between the two structures: we test iterations of adding elements to an already big structure, finding present elements in the structure, and a mix of adding and finding (this test is executed by timing the execution of `tests/param.ml`

, self-documented, by Gabriel Scherer, praise him) :

```
| | Map | HAMT |
| add | 7.4s | 8.6s |
| find | 7.9s | 3.9s |
| find + add | 14.9s | 11.8s |
```

We see that the HAMT are a little (although the difference seems to increase with the dimension of the test) slower on insertions, but really quicker on searches. They use Arrays and do a quite big number of copies when modifying a key, so the GC is heavily sollicitated: if we byte-compile the file and we use `OCAMLRUNPARAM="s=5M"`

to boost-up it, HAMT.add becomes a little faster than Map.add, and find is a little sped-up. However, the precise conditions leading to this results are not really understood: the size of the test has apparently a non-negligeable influence, and I cannot really make reliable performance comparisons. For an example, without modifying the GC, I just executed the following commands:

```
$ time ./param.byte hamt 300000 add
./param.byte hamt 300000 add 15,98s user 0,07s system 99% cpu 16,083 total
$ time ./param.byte map 300000 add
./param.byte map 300000 add 19,59s user 0,03s system 99% cpu 19,665 total
$ time ./param.native hamt 300000 add
./param.native hamt 300000 add 8,95s user 0,09s system 99% cpu 9,061 total
$ time ./param.native map 300000 add
./param.native map 300000 add 7,41s user 0,05s system 99% cpu 7,475 total
```

The bottom line is: the structure seems to be a good alternative to Maps. It is always (and often much) faster on finds, and has equivalent performances on add (sometimes faster, sometimes slower, under conditions that I do not really know). Furthermore, missed findings are faster than successful ones in HAMT, and this accentuates the difference in favour of this structure (even if it is not *that* accentuated).

Now that we got promising results about the structure, let's dissect it to understand how it works.

### HAMT: what is it ?

So we already know the basis of the structure: it's a prefix tree over the hashes of the key. More precisely, every node is (for now) a leaf (`key, value`

pair) or a 32 elements array (containing his sons), and to traverse the structure, we hash our key, and we take the bits from this hash 5 by 5. Thus at each level we have a 5 bits chunk, being a number between 0 and 31 which will indicate which son to choose. If it is `Empty`

, then the considered key is not present in the structure. Insertion is then trivial: just replace it by your `Leaf (k, v)`

. If we encounter a leaf, either its key is the one we are looking for, either... it isn't. If it is, insertion is made by changing the value. If it is not, insert an internal node (an array) at this place, take the chunks of hash corresponding to this depth of both keys, and put each one at its place (you may have to do it several times if the chunks appear to be the same). If we encounter an internal node, just take the following 5 bits and continue the descent. At the end of the hash, if we still are on an internal node, either we use open hashes (then there is no "end of the hash", but we cannot usually guarantee to terminate), or we place buckets in the last arrays instead of leaves.

If we use a correctly distributed hash, our structure will therefore have a depth of (at most) `log32(n)`

. Let's estimate what it means: if we want to build a structure associating to each single word of *War and Peace* its number of occurrences, we only need a depth of 4. If our keys are representing every people on Earth, we need a depth of 7. If they are every single atom on the universe, it's less than 60. Of course, our hash function will not be perfect, but we can see with these numbers that with very few chunks extraction and comparison, we can index a very important key set. We also need few array copies when we modify something on our table (because, to be persistent, we need to copy the arrays leading to our tip node from the root: that's the cost of fast indexing. *Boire ou conduire, il faut choisir.*)

However, we have a problem with this representation: each array uses a constant space, even if it is almost empty. If our structure is not very dense, we can use up to 32 times more space than needed. On big datasets, it can be problematic. There is an (elegant) astuce to fix this: rather than arrays of size 32 at each level, we only use arrays that have exactly as many cases as sons. To get the index on our array given our "virtual" index (which is the value of our chunk), we use a *bitmap*: a 32-bits integer, where each bit indicates the presence at this position of a son of our node. The order of the elements on the array follows the order of the chunks, then to get the real indice of a chunk, we just have to count the 1-bits at a indice lower than it in the bitmap. Using this method, every case of an array is useful, and if we adapted the size of our structure as I mentioned before (remember, the branching factor), the bitmap fills exactly one machine word: we have a quite optimal space use.

In his article, Phil Bagwell also uses a root table which is treated differently. I did not use this method in my implementation, so I will not talk much about it: just know basically that the first table is resized when the number of elements in the structure increases, to guarantee a constant-time access to every key (at the cost of resizing sometimes, which seems to be amortized over insertions^{2}).

Finally, let's talk about "theoretical performances": our complexity is the same as balanced binary trees, namely `O(log(n))`

for elementary operations (insertion, deletion, searching), except that our `log`

is not the same and grows very slowly (but, well, maths are strict and `O(.)`

does not want to know about that). Misses are faster than successful researches: if a key is not present in the tree, and if the hash function has good properties, we should quickly find a difference between the hash of our keys and the ones of those who are in our table. But in this kind of problem, the real determining factor is the multiplicative one behind our asymptotic complexity: we saw in the first part that, in OCaml, this factor seems to be nice enough to give us good hopes about the structure. Let's now examine the practical implementation.

### In OCaml: a few lines of code.

In this part, I will not comment about every detail of my implementation, but simply show a few lines and give short explanations about how the stuff works. My implementation is strongly inspired by exclipy's one in Haskell.

Let's start by the counting of the 1-bits in a bitmap: the modern processors offer a `POPCOUNT`

instruction which does the job. However, OCaml does not propose this instruction: we have to recode it by ourselves.

```
let sk5 = 0x55555555
let sk3 = 0x33333333
let skf0 = 0xf0f0f0f
let skff = 0xff00ff
let ctpop map =
let map = map - (map lsr 1) land sk5 in
let map = map land sk3 + (map lsr 2) land sk3 in
let map = map land skf0 + (map lsr 4) land skf0
in let map = map + map lsr 8
in (map + map lsr 16) land 0x3f
```

This code is adapted from the one Phil Bagwell proposes in C in his paper. I have no clue why it works (and I do not want to), but well, it does. It counts the number of 1-bits in a bitmap, and when you want them before a given position `n`

, just use `land`

to set the other bits to `0`

. The interesting point is that this is a portion of code which could be fastened: in an ideal world where everybody would love HAMT, we could imagine an OCaml implementing default support for `CTPOP`

. Anyway, thinking about the performances of this function would not be a loss of time when the era of micro-optimisation has come.

Here is the definition of the HAMT type:

```
type 'a t =
| Empty
| Leaf of int * key * 'a
| HashCollision of int * (key * 'a) list
| BitmapIndexedNode of int * 'a t array
| ArrayNode of int * 'a t array
```

Buckets are managed by association lists: it is probably weak to hash collision attacks. Using Maps would probably fix this problem, at the cost of having to use comparable keys. We could also use open hashes: not to decide, I simply used lists and kept the `Leaf`

constructor (instead of a `HashCollision`

with a one element list), so that it could be easily modified in the functions if another way was preferable. As for `BitmapIndexedNode`

and `ArrayNode`

, they correspond to the root table of Phil Bagwell: rather than making it grow when inserting many data, I transform bitmap arrays into full length arrays when they are dense enough: therefore, there is no need to compute a `CTPOP`

, and there is no big space loss because the constructor is only used when arrays already contain almost 32 elements. The performance fact is to be analysed: we could think that using only `ArrayNode`

would lead to better performances (at the cost of a lot of space used), but in fact, it sensibly lowers them. Then, if even dense arrays are fast, why bother with `ArrayNode`

? Because using only `BitmapIndexedNode`

also lowers the performances. The exact frontier (in terms of proportional filling of the array) between the two constructors is not yet determined, but these values seems to make it behave well:

```
module StdConfig : CONFIG = struct
let shift_step = 5
let bmnode_max = 16
let arrnode_min = 8
end
```

```
module StdConfig32 : CONFIG = struct
let shift_step = 4
let bmnode_max = 8
let arrnode_min = 4
end
```

There is also a point to be noticed: as OCaml uses one bit of the integers for garbage collecting purposes, on 64 bits architectures, we can only use 32 bits bitmaps. This also leads to a performance issue: shiftings, intensely used in HAMT (as in every bit manipulating code), are not straight processor shiftings, because this supplementary bit has to be conserved. It cannot simply be solved in OCaml, so we must take it into account when reasoning about the performances we can expect (and the tests showed the structure was still interesting).

The general purpose modification function is this one:

```
val alter : key -> ('a option -> 'a option) -> 'a t -> 'a t
```

It uses `option`

type to be generic over the modification you want to implement: this allows to write only a big function (73 lines) to manage all possible cases, but could degrade the performances. When the module is stable, it could be specialised. The same sort of interface is used for the `alter_all`

function, specialised in `map`

, `filter`

and all these things.

As we just saw, HAMT use many arrays. Therefore, at the insertion of an element, you need to recopy all the arrays from him to the root of the structure to keep it persistent. When you need to insert at one time many values, copying the ancestors for every element would uselessly long: we can just copy the destination HAMT first, and then use mutability to modify it in place before returning it. Internally, the mutability allows better performances, and using correct module interfaces we can prevent the user from unwillingly break his code by muting his structures. The *real* alter function (not exported) thus has this type:

```
val alter_node :
?mute:bool -> int -> int -> Key.t ->
('a option -> 'a option) -> 'a t -> 'a t
```

The first `int`

parameter is the hash shift value at the given depth (it is given `0`

at the beginning and is recursively incremented), the second one is the hashed value of the key (which is computed once and recursively given as a parameter rather than calculate it for every chunk). Therefore we can define another function, useful for imports:

```
let add_mute k v hamt =
alter_node ~mute:true 0 (hash key) k (fun _ -> Some v) hamt
module Import =
struct
module type FOLDABLE = sig
type key
type 'v t
val fold : (key -> 'v -> 'a -> 'a) -> 'v t -> 'a -> 'a
end
module Make (M : FOLDABLE with type key = key) =
struct
let add_from x hamt = M.fold add_mute x (copy hamt)
let from x = add_from x Empty
end
end
```

The module `Import`

provides a function usable to import many values in a HAMT, empty or not, from a structure that you can fold. It uses mutability to speed-up the process. Obviously, if your destination HAMT is already big and your structure quite small, copying the big full HAMT before muting him a few times can be slower than copying a few times the line from the root to an inserted value. When the limit is reached is to be established by the user (because a theoretical formula with logarithms would probably not be that pertinent in real code).

The rest of the code is quite straightforward once you understood the principles of the structure.

### Conclusion

The HAMT data structure appears to be quite promising for the language OCaml. Both its theoretical complexity and practical performances make it a good competitor to the great old Map, as long as your program does not already rely on comparisons.

However, this implementation should be considered more as a proof of concept rather than the final and optimal way to implement them. There are still many behaviours to be tested and understood, and quickly coded functions which could take advantage of a more thorough reflection.

Ideal Hash Trees, 2001, Phil Bagwell, Institute of Core Computing Science, Swiss Institute of Technology Lausanne↩

I must confess that I think I did not totally understand this part, because I do not exactly come to the same conclusions.↩