Tuesday October 8th 2019
Talk: A new allocator for OCaml
Speaker: Damien Doligez
Live notes by Gabriel Scherer


Damien: this is work I did during the summer, changing the allocator
  of the OCaml Garbage Collector (GC).  The result turns out to work
  very well.


The allocator is at the center between the program and the GC.

Minor heap allocation: trivial.

The Major GC is incremental, so it does not compact on each
  collection; we have alternations of live blocks and free blocks in
  the major heap. The allocator must find the right free block for an
  allocation request.

(When I'm talking about the major heap, note that the minor GC is part
  of the program, it's a client of the allocator (when promoting) just
  like the user program.)

  Gabriel: when you promote, do you allocate a big block for all
  promoted words, or do you allocate each separately?
  
  Damien: I allocate each separately.
  
  Gabriel: so you have a big batch of small allocations coming in at
  roughtly the same time.

Two goals: avoid overhead, avoid fragmentation.

Existing strategies: next-fit, best-fit.

Next-fit: written in 1988, following description by Knuth in
  1973.

  Free blocks are chained in a linked list in increasing order of
  addresses.

  The free list has to be sorted because the GC is the only one
  freeing blocks, it does so by sweeping, in increasing memory
  address. So having the free list sorted lets sweeping be implemented
  nicely, as a merge of in-order data structures.

  Jean-Christophe: if in a free block you have the size of the free
  block, you can merge blocks on the flight [...] you don't have to
  impose ordering.

  Damien: you need either a word for each free block (too expensive on
  Lisp-style workloads with small blocks) or a bit in each header that
  follows a free block, and that also is expensive at least on 32bits.

Problem with next-fit: when you find a large block, all small
  allocatoins will go there and fragment that block.

  Next-fit is not very good, but it's surprising that it took 25 years
  to notice that.

20 years later (2008) we noticed the problem, and I decided to write
  a first-fit allocation strategy.

  Allocate at the start each time, with an auxiliary table to "skip"
  some pointers on the list.

  Counter-intuitive, but it works better than best-fit: small blocks
  are allocated at the beginning of the list, and this preserves nice
  big blocks towards the end. It's kind of naturally-segregated to
  preserve big free blocks.

  (typo: it fails *to* preserve the "end blocK")

  First-fit is usually fast, but there are bad cases where it's
  extremely slow (in some cases it eats 50% of the program runtime!).

  First-fit assumes that newly-allocated memory (from malloc) is at
  the end of the program's memory space. This is false when ASLR is
  active (Adress Space Layout Randomization, a security feature of
  modern systems).

  Gabriel: can you relax the ordering requirement?
  Jacques-Henri: sort in virtual-memory order?
  Damien: that doesn't work because of allocation colors.

  Jean-Christophe: one free-list per [2^k..2^{k+1}[ batch?

  Damien: still have to search in the free list. Another solution is
  to ask that all blocks have a size 2^p, but then you have "internal
  fragmentation".

Wilson et al: a 1995 survey on memory-management. I should have read
  it 25 years ago! It's great, everyone should read it.

  It critizes the way GCs have been evaluated with uniform
  distribution of allocation sizes: real programs don't at all follow
  a uniform distribution, so this evaluatoin is flawed.

  Two important ideas I got from the survey:
  - for small sizes, have one free list per size
  - for large sizes, use a best-fit policy

  Best-fit is when you consider the whole set of available blocks, and
  you choose the smallest one that can satisfy the request. This is
  intuitively rather good. It has a tendency to leave smaller leftover
  blocks, which is good for OCaml as we allocate a lot of very-small
  blocks (99% allocations of sizes in [1; 5]).

Segregated-fit for small sizes.

  If the free list of the desired size is empty, split a free block
  from the size just above. (best-fit kicks in as the general strategy
  there.)

Best-fit tree structure

  Difference between strategy and policy: you could take from either
  ends in the doubly-linked lists.

  Gabriel: doubly-linked list consumes more space than singly-linked?
  Jean-Christophe: but those are free blocks of size >N, so you have
    plenty of metadata space available in the block itself.

  New idea: use a splay tree as the tree data-structure. (When
  Standish proposed this two-sided design in 1980, splay trees where
  not invented yet; Sleator and Tarjan, 1985.)

Optimization: empty free lists

  If you want to allocate a size whose free list is empty, you have to
  find the smallest bigger non-empty free list.

  Use a bitfield to know which free lists are non-empty.

  If all free-lists are empty, cut a piece from the smallest block of
  the larger-block tree. In most cases, this block can stay where it
  is in the tree, so this is very fast.

Removing a free block during sweeping (coalescion)

Some benchmark results.

  Very good results: better than both other policies.

  Some cases are pathological for either next-fit or first-fit, and
  they work well with best-fit.

  (Jane Street benchmarks results are reportedly very good but the
  code is not public. Alain Frisch tried on both internal Lexifi
  programs and synthetic code. We could/should try on Coq and Why3.)

Conclusions

  I have to add Jacques-Henri to the Credits line.


Questions

  Francesco: regarding the benchmarks, you are explaining the result
    of the benchmark by saying that the collection is called more often
    in the best-fit case.

    In my experience, common GCs try to allocate before they make
    a collection. If they can allocate they don't have to do
    a collection.

  Damien: OCaml's GC is incremental, so it collects on the fly. The
    space_overhead parameter decides the ratio of program time
    vs. incremental-collection time.

  Francesco: you say most allocations are small, why can't they be
    stack-allocated?

    Gabriel: the minor GC is very fast (decrement and check), not much
    slower than a stack allocation. 

    Gabriel: has someone worked on stack-allocation optimizations?

    Damien: I don't know.

  Kim: the Java people switched from Hotspot to G1. Jean-Christophe
    has a (synthetic) program that is 4x as slow with G1.

    Do you think we may have the same problem where more workloads are
    sensibly slower?

    Jean-Christophe: my program was allocating larger and larger
      arrays (10*N, for N from 1 to K), with only the last array being
      live.

    Damien: sounds like a worst-case for fragmentation.

  Jacques-Henri: perhaps the worst-case for best-fit is if the
    distribution of allocations is completely random.
 
    Damien: maybe we could try that and see what happens.

  Jean-Christophe: when the allocater is calling malloc, what are the
    size of blocks asked to malloc?

  Damien: it used to be fixed sizes, but now it's a percentage of the
    heap size. The percentage was designed to avoid issues in next-fit
    (amortizing the cost of the complete list traversal). We could
    probably go back with fixed-increments with best-fit now.

  Jean-Christophe: you start with one large block in the splay
    tree. When you malloc a new block, you put it on the free list and
    then you split.

    If you never malloc, you repeatedly split from the initial
    allocation and the splay tree has only one node.

  Gabriel: do you need more reviews? (Catching issues, and also
    increasing the number of people who are familiar with the
    implementation.)

    Damien: I don't know.

  Gabriel: could you integrate some of your slides in the code as
    documentation comments?

  Jacques-Henri: what I used to understand the invariants is the
    "check" function that spells out the invariants.

    I think it would be good to make all the invariants and the
    purposes of the data-structures explicit in the code.