## Implementing HAMT for OCaml

- February 2, 2013

During the second part of my internship this summer, I worked on a test implementation for a new data structure in OCaml : Hash Array Mapped Tries (HAMT). I did not have the time to finish it, but I now have a usable and quite complete version, still to be more documented but already available at http://gitorious.org/ocaml-hamt.

On this article, I will quickly explain the principles and the use cases of this structure, and I will expose a few details of my implementation in OCaml.

### HAMT: Why, where, how ?

When you want to play with associative tables over non-integer keys,
if you can use imperative constructions, you do not usually hesitate a
long time : you use Hash Tables. This structure guarantees
`O(1)`

complexity for all elementary operations, provided you
can hash your values. However, it is by essence a mutable structure: if
you want to use a functional style, you need persistent ones (and
copying a Hash Table at every single insertion is obviously right
out).

The usual choice is then to use balanced binary trees (like Maps in
OCaml): if you can provide an order relation on your keys, elementary
operations become `O(log(n))`

. In 2001, a new data structure
is introduced by Phil Bagwell: Hash Array Mapped Tries^{1}.
Where balanced binary trees have a branching factor of 2, thus having
approximately a `log2(n)`

depth (`n`

is the number
of elements), HAMT are prefix trees which have a much greater factor,
namely 32 in the original article (in fact, the best value is the size
of a word of your machine, we will see why).

But how to build *prefix* trees with your own type of key ?
That’s where hashing enters the scene. If you can hash your key, then
you can consider that the keys of your table are not your key type, but
the hashed values of your own keys. Thus you can build a prefix tree,
with an obvious notion of prefix that I will not explain. There is also
a second advantage: if your hash function is correctly distributed, your
tree is automagically balanced: you do not it to set up by rebalancing
sometimes nor by keeping extra information. Being a “prefix tree over
the hashes of the keys” is therefore the basis of the structure.

HAMT have been getting publicity since they where picked as a central persistent datastructure in the Clojure programming language. We have also had encouraging feedback from use in Scala, and more recently implementations in Haskell. It seems that persistent HAMT are an interesting datastructure; yet we should not jump to conclusions, as the details of one’s runtime system may have important impacts on the viability of subtle data layout choices. The purpose of the current prototype was to test the waters to see whether the “HAMT success story” could be reproduced in the OCaml world.

Its use case is rather obvious: when you need a persistent associative table over a key that you can hash, you can use HAMT. However, there is no notion of comparison in standard HAMT: if you need to be able to get quickly the value associated with the minimal key, they are not a good choice. So where is the advantage over Maps ?

Here is a small, early comparison between the two structures: we test
iterations of adding elements to an already big structure, finding
present elements in the structure, and a mix of adding and finding (this
test is executed by timing the execution of `tests/param.ml`

,
self-documented, by Gabriel Scherer, praise him) :

```
| | Map | HAMT |
| add | 7.4s | 8.6s |
| find | 7.9s | 3.9s |
| find + add | 14.9s | 11.8s |
```

We see that the HAMT are a little (although the difference seems to
increase with the dimension of the test) slower on insertions, but
really quicker on searches. They use Arrays and do a quite big number of
copies when modifying a key, so the GC is heavily sollicitated: if we
byte-compile the file and we use `OCAMLRUNPARAM="s=5M"`

to
boost-up it, HAMT.add becomes a little faster than Map.add, and find is
a little sped-up. However, the precise conditions leading to this
results are not really understood: the size of the test has apparently a
non-negligeable influence, and I cannot really make reliable performance
comparisons. For an example, without modifying the GC, I just executed
the following commands:

```
$ time ./param.byte hamt 300000 add
./param.byte hamt 300000 add 15,98s user 0,07s system 99% cpu 16,083 total
$ time ./param.byte map 300000 add
./param.byte map 300000 add 19,59s user 0,03s system 99% cpu 19,665 total
$ time ./param.native hamt 300000 add
./param.native hamt 300000 add 8,95s user 0,09s system 99% cpu 9,061 total
$ time ./param.native map 300000 add
./param.native map 300000 add 7,41s user 0,05s system 99% cpu 7,475 total
```

The bottom line is: the structure seems to be a good alternative to
Maps. It is always (and often much) faster on finds, and has equivalent
performances on add (sometimes faster, sometimes slower, under
conditions that I do not really know). Furthermore, missed findings are
faster than successful ones in HAMT, and this accentuates the difference
in favour of this structure (even if it is not *that*
accentuated).

Now that we got promising results about the structure, let’s dissect it to understand how it works.

### HAMT: what is it ?

So we already know the basis of the structure: it’s a prefix tree
over the hashes of the key. More precisely, every node is (for now) a
leaf (`key, value`

pair) or a 32 elements array (containing
his sons), and to traverse the structure, we hash our key, and we take
the bits from this hash 5 by 5. Thus at each level we have a 5 bits
chunk, being a number between 0 and 31 which will indicate which son to
choose. If it is `Empty`

, then the considered key is not
present in the structure. Insertion is then trivial: just replace it by
your `Leaf (k, v)`

. If we encounter a leaf, either its key is
the one we are looking for, either… it isn’t. If it is, insertion is
made by changing the value. If it is not, insert an internal node (an
array) at this place, take the chunks of hash corresponding to this
depth of both keys, and put each one at its place (you may have to do it
several times if the chunks appear to be the same). If we encounter an
internal node, just take the following 5 bits and continue the descent.
At the end of the hash, if we still are on an internal node, either we
use open hashes (then there is no “end of the hash”, but we cannot
usually guarantee to terminate), or we place buckets in the last arrays
instead of leaves.

If we use a correctly distributed hash, our structure will therefore
have a depth of (at most) `log32(n)`

. Let’s estimate what it
means: if we want to build a structure associating to each single word
of *War and Peace* its number of occurrences, we only need a
depth of 4. If our keys are representing every people on Earth, we need
a depth of 7. If they are every single atom on the universe, it’s less
than 60. Of course, our hash function will not be perfect, but we can
see with these numbers that with very few chunks extraction and
comparison, we can index a very important key set. We also need few
array copies when we modify something on our table (because, to be
persistent, we need to copy the arrays leading to our tip node from the
root: that’s the cost of fast indexing. *Boire ou conduire, il faut
choisir.*)

However, we have a problem with this representation: each array uses
a constant space, even if it is almost empty. If our structure is not
very dense, we can use up to 32 times more space than needed. On big
datasets, it can be problematic. There is an (elegant) astuce to fix
this: rather than arrays of size 32 at each level, we only use arrays
that have exactly as many cases as sons. To get the index on our array
given our “virtual” index (which is the value of our chunk), we use a
*bitmap*: a 32-bits integer, where each bit indicates the
presence at this position of a son of our node. The order of the
elements on the array follows the order of the chunks, then to get the
real indice of a chunk, we just have to count the 1-bits at a indice
lower than it in the bitmap. Using this method, every case of an array
is useful, and if we adapted the size of our structure as I mentioned
before (remember, the branching factor), the bitmap fills exactly one
machine word: we have a quite optimal space use.

In his article,
Phil Bagwell also uses a root table which is treated differently. I did
not use this method in my implementation, so I will not talk much about
it: just know basically that the first table is resized when the number
of elements in the structure increases, to guarantee a constant-time
access to every key (at the cost of resizing sometimes, which seems to
be amortized over insertions^{2}).

Finally, let’s talk about “theoretical performances”: our complexity
is the same as balanced binary trees, namely `O(log(n))`

for
elementary operations (insertion, deletion, searching), except that our
`log`

is not the same and grows very slowly (but, well, maths
are strict and `O(.)`

does not want to know about that).
Misses are faster than successful researches: if a key is not present in
the tree, and if the hash function has good properties, we should
quickly find a difference between the hash of our keys and the ones of
those who are in our table. But in this kind of problem, the real
determining factor is the multiplicative one behind our asymptotic
complexity: we saw in the first part that, in OCaml, this factor seems
to be nice enough to give us good hopes about the structure. Let’s now
examine the practical implementation.

### In OCaml: a few lines of code.

In this part, I will not comment about every detail of my implementation, but simply show a few lines and give short explanations about how the stuff works. My implementation is strongly inspired by exclipy’s one in Haskell.

Let’s start by the counting of the 1-bits in a bitmap: the modern
processors offer a `POPCOUNT`

instruction which does the job.
However, OCaml does not propose this instruction: we have to recode it
by ourselves.

```
let sk5 = 0x55555555
let sk3 = 0x33333333
let skf0 = 0xf0f0f0f
let skff = 0xff00ff
let ctpop map =
let map = map - (map lsr 1) land sk5 in
let map = map land sk3 + (map lsr 2) land sk3 in
let map = map land skf0 + (map lsr 4) land skf0
in let map = map + map lsr 8
in (map + map lsr 16) land 0x3f
```

This code is adapted from the one Phil Bagwell proposes in C in his
paper. I have no clue why it works (and I do not want to), but well, it
does. It counts the number of 1-bits in a bitmap, and when you want them
before a given position `n`

, just use `land`

to
set the other bits to `0`

. The interesting point is that this
is a portion of code which could be fastened: in an ideal world where
everybody would love HAMT, we could imagine an OCaml implementing
default support for `CTPOP`

. Anyway, thinking about the
performances of this function would not be a loss of time when the era
of micro-optimisation has come.

Here is the definition of the HAMT type:

```
type 'a t =
| Empty
| Leaf of int * key * 'a
| HashCollision of int * (key * 'a) list
| BitmapIndexedNode of int * 'a t array
| ArrayNode of int * 'a t array
```

Buckets are managed by association lists: it is probably weak to hash
collision attacks. Using Maps would probably fix this problem, at the
cost of having to use comparable keys. We could also use open hashes:
not to decide, I simply used lists and kept the `Leaf`

constructor (instead of a `HashCollision`

with a one element
list), so that it could be easily modified in the functions if another
way was preferable. As for `BitmapIndexedNode`

and
`ArrayNode`

, they correspond to the root table of Phil
Bagwell: rather than making it grow when inserting many data, I
transform bitmap arrays into full length arrays when they are dense
enough: therefore, there is no need to compute a `CTPOP`

, and
there is no big space loss because the constructor is only used when
arrays already contain almost 32 elements. The performance fact is to be
analysed: we could think that using only `ArrayNode`

would
lead to better performances (at the cost of a lot of space used), but in
fact, it sensibly lowers them. Then, if even dense arrays are fast, why
bother with `ArrayNode`

? Because using only
`BitmapIndexedNode`

also lowers the performances. The exact
frontier (in terms of proportional filling of the array) between the two
constructors is not yet determined, but these values seems to make it
behave well:

```
module StdConfig : CONFIG = struct
let shift_step = 5
let bmnode_max = 16
let arrnode_min = 8
end
```

```
module StdConfig32 : CONFIG = struct
let shift_step = 4
let bmnode_max = 8
let arrnode_min = 4
end
```

There is also a point to be noticed: as OCaml uses one bit of the integers for garbage collecting purposes, on 64 bits architectures, we can only use 32 bits bitmaps. This also leads to a performance issue: shiftings, intensely used in HAMT (as in every bit manipulating code), are not straight processor shiftings, because this supplementary bit has to be conserved. It cannot simply be solved in OCaml, so we must take it into account when reasoning about the performances we can expect (and the tests showed the structure was still interesting).

The general purpose modification function is this one:

```
val alter : key -> ('a option -> 'a option) -> 'a t -> 'a t
```

It uses `option`

type to be generic over the modification
you want to implement: this allows to write only a big function (73
lines) to manage all possible cases, but could degrade the performances.
When the module is stable, it could be specialised. The same sort of
interface is used for the `alter_all`

function, specialised
in `map`

, `filter`

and all these things.

As we just saw, HAMT use many arrays. Therefore, at the insertion of
an element, you need to recopy all the arrays from him to the root of
the structure to keep it persistent. When you need to insert at one time
many values, copying the ancestors for every element would uselessly
long: we can just copy the destination HAMT first, and then use
mutability to modify it in place before returning it. Internally, the
mutability allows better performances, and using correct module
interfaces we can prevent the user from unwillingly break his code by
muting his structures. The *real* alter function (not exported)
thus has this type:

```
val alter_node :
?mute:bool -> int -> int -> Key.t ->
('a option -> 'a option) -> 'a t -> 'a t
```

The first `int`

parameter is the hash shift value at the
given depth (it is given `0`

at the beginning and is
recursively incremented), the second one is the hashed value of the key
(which is computed once and recursively given as a parameter rather than
calculate it for every chunk). Therefore we can define another function,
useful for imports:

```
let add_mute k v hamt =
alter_node ~mute:true 0 (hash key) k (fun _ -> Some v) hamt
module Import =
struct
module type FOLDABLE = sig
type key
type 'v t
val fold : (key -> 'v -> 'a -> 'a) -> 'v t -> 'a -> 'a
end
module Make (M : FOLDABLE with type key = key) =
struct
let add_from x hamt = M.fold add_mute x (copy hamt)
let from x = add_from x Empty
end
end
```

The module `Import`

provides a function usable to import
many values in a HAMT, empty or not, from a structure that you can fold.
It uses mutability to speed-up the process. Obviously, if your
destination HAMT is already big and your structure quite small, copying
the big full HAMT before muting him a few times can be slower than
copying a few times the line from the root to an inserted value. When
the limit is reached is to be established by the user (because a
theoretical formula with logarithms would probably not be that pertinent
in real code).

The rest of the code is quite straightforward once you understood the principles of the structure.

### Conclusion

The HAMT data structure appears to be quite promising for the language OCaml. Both its theoretical complexity and practical performances make it a good competitor to the great old Map, as long as your program does not already rely on comparisons.

However, this implementation should be considered more as a proof of concept rather than the final and optimal way to implement them. There are still many behaviours to be tested and understood, and quickly coded functions which could take advantage of a more thorough reflection.

Ideal Hash Trees, 2001, Phil Bagwell, Institute of Core Computing Science, Swiss Institute of Technology Lausanne↩︎

I must confess that I think I did not totally understand this part, because I do not exactly come to the same conclusions.↩︎