This is a call to arms: to improve the OCaml compiler, we sorely need a suite of representative benchmarks. If you have ever written performance-sensitive programs in OCaml, you can contribute.
PS: When I originally started to write this blog post, it was about a small experiment on optimizing a small program for float unboxing. Writing a general introduction on the few joys and numerous pitfalls of compiler optimizations suddenly turned it into a totally different post. Oh well.
Having a simple compiler is a valid design choice
The OCaml native compiler is well-known for its simple and robust behavior. Thanks to efficient choices of value representation and a sophisticated garbage collector (GC), it is able to make good OCaml programs run fast. But it doesn’t actually optimize in clever way: efficient code will give an efficient program, but little is done to make inefficient code efficient.
This approach has some nice advantages. On the user side, it makes for a predictable compiler (no hope of the killer optimization kicking in when it actually doesn’t due to a small change). On the developer side we have a well-written and rather simple post-frontend that is easy to understand and maintain. Compilation times are also usually very small: with the additional good support for separate compilation, compilation times is rarely a problem in OCaml as it can be in some other compiled languages (cough C++ cough).
Yet this also comes with drawbacks. For example, “inefficient” code is rarely written by OCaml programmers (smart men and women), of course :), but is easy to obtain when you’re doing automatic code generation; you may have to be a bit careful if you produce OCaml programs to not spit out stupid code, while a general whole-program optimizer could have taken care of this for you.
Another problem is that some abstractions are more costly that we would like them to be. It’s hard to write monadic code, for example, without generating a lot of (often unnecessary) closures. While the cost of the additional closure allocations and function calls are often drowned by the overall monadic logic (for example thread scheduling in monadic cooperative threading libraries such as Lwt and Async), fine-grained monad style can still have a sensible overhead.
For example, Arnaud Spiwak reported that switching the Coq internal tactics to a pervasive monadic style incurred sensible overhead (~15% slowdown on the whole Coq archive, with some individual tactics getting several times slower). Using Coq’s inlining engine to maximally reduce monadic closures, he was able to get a 10% performance gain over what “idiomatic monadic OCaml code” produces through the OCaml compiler. (Remark that inlining is easier to do in Coq thanks to a pure (including total) programming language with sensible strong reduction semantics.)
Finally, not every programming style and application domain is equal in the face of OCaml compilation choices. Memory allocation, tail calls, exceptions and pattern matching on algebraic datatypes are extremely fast, and code using lot of them (typically: a compiler, a proof assistant, symbolic computations in general) will have you extremely satisfied. But this sometimes required compromises that have downsides for other language features, and the sophistication effort that went there didn’t necessarily find itself reproduced in other parts of the language features that were of less interest at the time of the compiler’s inception.
As a concrete example, OCaml does not shine at number crunching: its integers are tagged to make the GC life easier, and its (double-precision) floating points are boxed to keep a uniform value representation. The compiler is well aware of these facts and tries to suppress this overhead when it is easy to do so, but those optimization are relatively simple and have room for improvement.
Compiler optimization is an ungrateful business
Just above I’ve mentioned a few fronts (function inlining, local unboxing or untagging) on which the OCaml compiler could possibly help with more aggressive optimizations. There are other specific areas where compilation choices could be improved. The problem with compiler optimizations is that there is a large disconnect between what we, compiler programmers and language geeks, find satisfying to work on, and the hard reality of actual program performances.
It is easy to spend some months implementing and debugging, say, a state-of-the-art register allocator, only to find out that it makes practically no performance difference on the programs you have at hand. Why then would the maintainer accept a large code change that certainly introduces bug and doesn’t provably improves over the naive but simple and debugged-by-time current implementation?
If advertising for a change in a compiler with the argument that “the generated code is much nicer”, you’re doing it wrong. It is surprisingly often the case that while the generated code looks much nicer to the naked eye, it is not actually faster on today’s overly-complicated processors. More importantly, it is generally the case that said nicer code never actually appears in the hot section of your program, and therefore produces no observable performance change. Again, “if it ain’t broke, don’t fix it”, it would be unreasonable to accept a change that cannot demonstrate a real improvement on a real program – while it’s generally easy to craft micro-benchmark that demonstrate impressive improvements.
I have personally experienced this problem in the work on PR#4800. Currently, OCaml optimizes tupled assignments such as
let (x,y,z) = (foo, bar, baz) to remove the tuple construction and destruction that would happen per the naive pattern-matching semantics. However this only works if the right-hand-side is syntactically a tuple, and would fail for example on
let (x,y) = if foo then (bar, baz) else (foobar, foobaz).
Alain Frisch proposed a patch to optimize this (and some other) situation, to make this multi-binding style more convenient to use. I’ve spent some time reviewing the patch and proposing an alternative implementation, but we’ve both been unable to find an actual real-world example where adding this optimization made a significant difference.
This is a delicate point because those optimizations are here to make new abstractions available that were previously avoided. If we don’t observe any difference, maybe it is because OCaml programmers have avoided that idiom so far, but would make convenient use of it if it was cost-free. However, if we can’t demonstrate a real example where this optimization is important, it’s probably not a good idea to include it upstream.
We really need a benchmark suite, and you can help
What we really need is a benchmark suite that would contain representative examples of real-world OCaml code. There is a handful of benchmarks distributed with the compiler but they don’t cover much. Likewise, the programs in the Shootout are too small to be representative of actual uses of OCaml.
This is why we currently use “big software” in OCaml to evaluate performance changes (eg. bootstrap the OCaml compiler itself, or compile Coq). Some big OCaml users have internal performance testsuites (for example Pascal Cuoq, which works on Frama-C, infrequently contributes performance-improving changes to the OCaml compiler), but we severely lack an easily-available, informative benchmark suite to evaluate performance changes.
This is a losing situation because optimization-related changes cannot be reliably assessed in absence of representative code, and specific problem domains that we can’t measure against will not be improved.
The good thing with benchmark needs is that it’s a problem the community at large can efficiently solve in parallel. If everyone contributed a piece of code that comes from real-world OCaml software and is representative of its performance-critical part, we would have a complete benchmark suite by now.
Feel free to contribute by providing, either here in the blog comments or by mail (gabriel dot scherer at inria, or at gmail), your benchmark program(s). Code that has few external dependencies and that comes with a simple interface to parametrize running time is highly preferred. Of course, an open source license is required (say MIT/BSD or LGPL with linking exception).
What is a representative benchmark? A benchmark is representative when performance changes observed in the benchmark will reliably be reproducible against (some use cases of) the whole software it was extracted or inspired from. The changes must be comparable in magnitude: if a 80% performance change (improvement or degradation) in the benchmark code results in a 0.5% performance change in the actual software, it’s difficult to rely on benchmark results to make design compromises.