Tuesday October 8th 2019 Talk: A new allocator for OCaml Speaker: Damien Doligez Live notes by Gabriel Scherer Damien: this is work I did during the summer, changing the allocator of the OCaml Garbage Collector (GC). The result turns out to work very well. The allocator is at the center between the program and the GC. Minor heap allocation: trivial. The Major GC is incremental, so it does not compact on each collection; we have alternations of live blocks and free blocks in the major heap. The allocator must find the right free block for an allocation request. (When I'm talking about the major heap, note that the minor GC is part of the program, it's a client of the allocator (when promoting) just like the user program.) Gabriel: when you promote, do you allocate a big block for all promoted words, or do you allocate each separately? Damien: I allocate each separately. Gabriel: so you have a big batch of small allocations coming in at roughtly the same time. Two goals: avoid overhead, avoid fragmentation. Existing strategies: next-fit, best-fit. Next-fit: written in 1988, following description by Knuth in 1973. Free blocks are chained in a linked list in increasing order of addresses. The free list has to be sorted because the GC is the only one freeing blocks, it does so by sweeping, in increasing memory address. So having the free list sorted lets sweeping be implemented nicely, as a merge of in-order data structures. Jean-Christophe: if in a free block you have the size of the free block, you can merge blocks on the flight [...] you don't have to impose ordering. Damien: you need either a word for each free block (too expensive on Lisp-style workloads with small blocks) or a bit in each header that follows a free block, and that also is expensive at least on 32bits. Problem with next-fit: when you find a large block, all small allocatoins will go there and fragment that block. Next-fit is not very good, but it's surprising that it took 25 years to notice that. 20 years later (2008) we noticed the problem, and I decided to write a first-fit allocation strategy. Allocate at the start each time, with an auxiliary table to "skip" some pointers on the list. Counter-intuitive, but it works better than best-fit: small blocks are allocated at the beginning of the list, and this preserves nice big blocks towards the end. It's kind of naturally-segregated to preserve big free blocks. (typo: it fails *to* preserve the "end blocK") First-fit is usually fast, but there are bad cases where it's extremely slow (in some cases it eats 50% of the program runtime!). First-fit assumes that newly-allocated memory (from malloc) is at the end of the program's memory space. This is false when ASLR is active (Adress Space Layout Randomization, a security feature of modern systems). Gabriel: can you relax the ordering requirement? Jacques-Henri: sort in virtual-memory order? Damien: that doesn't work because of allocation colors. Jean-Christophe: one free-list per [2^k..2^{k+1}[ batch? Damien: still have to search in the free list. Another solution is to ask that all blocks have a size 2^p, but then you have "internal fragmentation". Wilson et al: a 1995 survey on memory-management. I should have read it 25 years ago! It's great, everyone should read it. It critizes the way GCs have been evaluated with uniform distribution of allocation sizes: real programs don't at all follow a uniform distribution, so this evaluatoin is flawed. Two important ideas I got from the survey: - for small sizes, have one free list per size - for large sizes, use a best-fit policy Best-fit is when you consider the whole set of available blocks, and you choose the smallest one that can satisfy the request. This is intuitively rather good. It has a tendency to leave smaller leftover blocks, which is good for OCaml as we allocate a lot of very-small blocks (99% allocations of sizes in [1; 5]). Segregated-fit for small sizes. If the free list of the desired size is empty, split a free block from the size just above. (best-fit kicks in as the general strategy there.) Best-fit tree structure Difference between strategy and policy: you could take from either ends in the doubly-linked lists. Gabriel: doubly-linked list consumes more space than singly-linked? Jean-Christophe: but those are free blocks of size >N, so you have plenty of metadata space available in the block itself. New idea: use a splay tree as the tree data-structure. (When Standish proposed this two-sided design in 1980, splay trees where not invented yet; Sleator and Tarjan, 1985.) Optimization: empty free lists If you want to allocate a size whose free list is empty, you have to find the smallest bigger non-empty free list. Use a bitfield to know which free lists are non-empty. If all free-lists are empty, cut a piece from the smallest block of the larger-block tree. In most cases, this block can stay where it is in the tree, so this is very fast. Removing a free block during sweeping (coalescion) Some benchmark results. Very good results: better than both other policies. Some cases are pathological for either next-fit or first-fit, and they work well with best-fit. (Jane Street benchmarks results are reportedly very good but the code is not public. Alain Frisch tried on both internal Lexifi programs and synthetic code. We could/should try on Coq and Why3.) Conclusions I have to add Jacques-Henri to the Credits line. Questions Francesco: regarding the benchmarks, you are explaining the result of the benchmark by saying that the collection is called more often in the best-fit case. In my experience, common GCs try to allocate before they make a collection. If they can allocate they don't have to do a collection. Damien: OCaml's GC is incremental, so it collects on the fly. The space_overhead parameter decides the ratio of program time vs. incremental-collection time. Francesco: you say most allocations are small, why can't they be stack-allocated? Gabriel: the minor GC is very fast (decrement and check), not much slower than a stack allocation. Gabriel: has someone worked on stack-allocation optimizations? Damien: I don't know. Kim: the Java people switched from Hotspot to G1. Jean-Christophe has a (synthetic) program that is 4x as slow with G1. Do you think we may have the same problem where more workloads are sensibly slower? Jean-Christophe: my program was allocating larger and larger arrays (10*N, for N from 1 to K), with only the last array being live. Damien: sounds like a worst-case for fragmentation. Jacques-Henri: perhaps the worst-case for best-fit is if the distribution of allocations is completely random. Damien: maybe we could try that and see what happens. Jean-Christophe: when the allocater is calling malloc, what are the size of blocks asked to malloc? Damien: it used to be fixed sizes, but now it's a percentage of the heap size. The percentage was designed to avoid issues in next-fit (amortizing the cost of the complete list traversal). We could probably go back with fixed-increments with best-fit now. Jean-Christophe: you start with one large block in the splay tree. When you malloc a new block, you put it on the free list and then you split. If you never malloc, you repeatedly split from the initial allocation and the splay tree has only one node. Gabriel: do you need more reviews? (Catching issues, and also increasing the number of people who are familiar with the implementation.) Damien: I don't know. Gabriel: could you integrate some of your slides in the code as documentation comments? Jacques-Henri: what I used to understand the invariants is the "check" function that spells out the invariants. I think it would be good to make all the invariants and the purposes of the data-structures explicit in the code.