Gagallium : How I found a bug in Intel Skylake processors

Instructors of “Introduction to programming” courses know that students are willing to blame the failures of their programs on anything. Sorting routine discards half of the data? “That might be a Windows virus!” Binary search always fails? “The Java compiler is acting funny today!” More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries; exceedingly rarely in the compiler; and never in the processor. That’s what I thought too, until recently. Here is how I ran into a bug in Intel Skylake processors while trying to debug mysterious OCaml failures.

The first sighting

Late April 2016, shortly after OCaml 4.03.0 was released, a Serious Industrial OCaml User (SIOU) contacted me privately with bad news: one of their applications, written in OCaml and compiled with OCaml 4.03.0, was crashing randomly. Not at every run, but once in a while it would segfault, at different places within the code. Moreover, the crashes were only observed on their most recent computers, those running Intel Skylake processors. (Skylake is the nickname for what was the latest generation of Intel processors at the time. The latest generation at the time of this writing is nicknamed Kaby Lake.)

Many OCaml bugs have been reported to me in the last 25 years, but this report was particularly troubling. Why Skylake processors only? Indeed, I couldn’t reproduce the crashes using SIOU’s binary on my computers at Inria, which were all running older Intel processors. Why the lack of reproducibility? SIOU’s application was single-threaded and made no network I/O, only file I/O, so its execution should have been perfectly deterministic, and whatever bug caused the segfault should cause it at every run and at the same place in the code.

My first guess was flaky hardware at SIOU: a bad memory chip? overheating? Speaking from personal experience, those things happen and can result in a computer that boots and runs a GUI just fine, then crashes under load. So, I suggested SIOU to run a memory test, underclock their processor, and disable hyperthreading (HT) while they were at it. The HT suggestion was inspired by an earlier report of a Skylake bug involving AVX vector arithmetic, which would show up only with HT enabled (see description).

SIOU didn’t take my suggestions well, arguing (correctly) that they were running other CPU- and memory-intensive tests on their Skylake machines and only the ones written in OCaml would crash. Clearly, they thought their hardware was perfect and the bug was in my software. Great. I still managed to cajole them into running a memory test, which came back clean, but my suggestion about turning HT off was ignored. (Too bad, because this would have saved us much time.)

In parallel, SIOU was conducting an impressive investigation, varying the version of OCaml, the C compiler used to compile OCaml’s runtime system, and the operating system. The verdict came as follows. OCaml: 4.03, including early betas, but not 4.02.3. C compiler: GCC, but not Clang. OS: Linux and Windows, but not MacOS. Since MacOS uses Clang and they used a GCC-based Windows port, the finger was being firmly pointed to OCaml 4.03 and GCC.

Surely, SIOU reasoned, in the OCaml 4.03 runtime system, there is a piece of bad C code – an undefined behavior as we say in the business – causing GCC to generate machine code that crashes, as C compilers are allowed to do in the presence of undefined behaviors. That would not be the first time that GCC treats undefined behaviors in the least possibly helpful way, see for instance this security hole and this broken benchmark.

The explanation above was plausible but still failed to account for the random nature of crashes. When GCC generates bizarre code based on an undefined behavior, it still generates deterministic code. The only source of randomness I could think of is Address Space Layout Randomization (ASLR), an OS feature that causes absolute memory addresses to change from run to run. The OCaml runtime system uses absolute addresses in some places, e.g. to index into a hash table of memory pages. However, the crashes remained random after turning ASLR off, in particular when running under the GDB debugger.

We were now in early May 2016, and it was my turn to get my hands dirty, as SIOU subtly hinted by giving me a shell account on their famous Skylake machine. My first attempt was to build a debug version of OCaml 4.03 (to which I planned to add even more debugging instrumentation later) and rebuild SIOU’s application with this version of OCaml. Unfortunately this debug version would not trigger the crash. Instead, I worked from the executable provided by SIOU, first interactively under GDB (but it nearly drove me crazy, as I had to wait sometimes one hour to trigger the crash again), then using a little OCaml script that ran the program 1000 times and saved the core dumps produced at every crash.

Debugging the OCaml runtime system is no fun, but post-mortem debugging from core dumps is atrocious. Analysis of 30 core dumps showed the segfaults to occur in 7 different places, two within the OCaml GC and 5 within the application. The most popular place, with 50% of the crashes, was the mark_slice function from OCaml’s GC. In all cases, the OCaml heap was corrupted: a well-formed data structure contains a bad pointer, i.e. a pointer that doesn’t point to the first field of a Caml block but instead points to the header or inside the middle of a Caml block, or even to invalid memory (already freed). The 15 crashes in mark_slice were all caused by a pointer two words ahead in a block of size 4.

All those symptoms were consistent with familiar mistakes such as the ocamlopt compiler forgetting to register a memory root with the GC. However, those mistakes would cause reproducible crashes, depending only on the allocation and GC patterns. I completely failed to see what kind of memory management bug in OCaml could cause random crashes!

By lack of a better idea, I then listened again to the voice at the back of my head that was whispering “hardware bug!”. I had a vague impression that the crashes happened more frequently the more the machine was loaded, as would be the case if it were just an overheating issue. To test this theory, I modified my OCaml script to run N copies of SIOU’s program in parallel. For some runs I also disabled the OCaml memory compactor, resulting in a bigger memory footprint and more GC activity. The results were not what I expected but striking nonetheless:

N	system load	w/default options	w/compactor turned off
1	3+epsilon	0 failures	0 failures
2	4+epsilon	1 failure	3 failures
4	6+epsilon	12 failures	19 failures
8	10+epsilon	17 failures	23 failures
16	18+epsilon	16 failures

The number of failures given above is for 1000 runs of the test program. Notice the jump between N = 2 and N = 4 ? And the plateau for higher values of N ? To explain those numbers, I need to give more information on the test Skylake machine. It has 4 physical cores and 8 logical cores, since HT is enabled. Two of the cores were busy with two long-running tests (not mine) in the background, but otherwise the machine was not doing much, hence the system load was 2 + N + epsilon, where N is the number of tests I ran in parallel.

When there are no more than 4 active processes at the same time, the OS scheduler spreads them evenly between the 4 physical cores of the machine, and tries hard not to schedule two processes on the two logical cores of the same physical core, because that would result in underutilization of the resources of the other physical cores. This is the case here for N = 1 and also, most of the time, for N = 2. When the number of active processes grows above 4, the OS starts taking advantage of HT by scheduling processes to the two logical cores of the same physical core. This is the case for N = 4 here. It’s only when all 8 logical cores of the machine are busy that the OS performs traditional time-sharing between processes. This is the case for N = 8 and N = 16 in our experiment.

It was now evident that the crashes happened only when hyperthreading kicked in, or more precisely when the OCaml program was running along another hyperthread (logical core) on the same physical core of the processor.

I wrote SIOU back with a summary of my findings, imploring them to entertain my theory that it all has to do with hyperthreading. This time they listened and turned hyperthreading off on their machine. Then, the crashes were gone for good: two days of testing in a loop showed no issues whatsoever.

Problem solved? Yes! Happy ending? Not yet. Neither I nor SIOU tried to report this issue to Intel or others: SIOU because they were satisfied with the workaround consisting in compiling OCaml with Clang, and because they did not want any publicity of the “SIOU’s products crash randomly!” kind; I because I was tired of this problem, didn’t know how to report those things (Intel doesn’t have a public issue tracker like the rest of us), and suspected it was a problem with the specific machines at SIOU (e.g. a batch of flaky chips that got put in the wrong speed bin by accident).

The second sighting

The year 2016 went by without anyone else reporting that the sky (or more exactly the Skylake) was falling with OCaml 4.03, so I gladly forgot about this little episode at SIOU (and went on making horrible puns).

Then, on January 6th 2017, Enguerrand Decorne and Joris Giovannangeli at Ahrefs (another Serious Industrial OCaml User, member of the Caml Consortium to boot) report mysterious random crashes with OCaml 4.03.0: this is PR#7452 on the Caml bug tracker.

In the repro case they provided, it’s the ocamlopt.opt compiler itself that sometimes crashes or produces nonsensical output while compiling a large source file. This is not particularly surprising since ocamlopt.opt is itself an OCaml program compiled with the ocamlopt.byte compiler, but mades it easier to discuss and reproduce the issue.

The public comments on PR#7452 show rather well what happened next, and the Ahrefs people wrote a detailed story of their bug hunt as a blog post. So, I’ll only highlight the turning points of the story.

Twelve hours after opening the PR, and already 19 comments into the discussion, Enguerrand Decorne reports that “every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family”.
The day after, I mention the 2016 random crash at SIOU and suggest to disable hyperthreading.
The day after, Joris Giovannangeli confirms that the crash cannot be reproduced when hyperthreading is disabled.
In parallel, Joris discovers that the crash happens only if the OCaml runtime system is built with gcc -O2, but not with gcc -O1. In retrospect, this explains the absence of crashes with the debug OCaml runtime and with OCaml 4.02, as both are built with gcc -O1 by default.
I go out on a limb and post the following comment:

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake (link)

Mark Shinwell contacts some colleagues at Intel and manages to push a report through Intel customer support.

Then, nothing happened for 5 months, until…

The revelation

On May 26th 2017, user “ygrek” posts a link to the following Changelog entry from the Debian “microcode” package:

* New upstream microcode datafile 20170511 [...]
* Likely fix nightmare-level Skylake erratum SKL150.  Fortunately,
either this erratum is very-low-hitting, or gcc/clang/icc/msvc
won't usually issue the affected opcode pattern and it ends up
being rare.
SKL150 - Short loops using both the AH/BH/CH/DH registers and
the corresponding wide register *may* result in unpredictable
system behavior.  Requires both logical processors of the same
core (i.e. sibling hyperthreads) to be active to trigger, as
well as a "complex set of micro-architectural conditions"

SKL150 was documented by Intel in April 2017 and is described on page 65 of 6th Generation Intel® Processor Family - Specification Update. Similar errata go under the names SKW144, SKX150, SKZ7 for variants of the Skylake architecture, and KBL095, KBW095 for the newer Kaby Lake architecture. “Nightmare-level” is not part of the Intel description but sounds about right.

Despite the rather vague description (“complex set of micro-architectural conditions”, you don’t say!), this erratum rings a bell: hyperthreading required? check! triggers pseudo-randomly? check! does not involve floating-point nor vector instructions? check! Plus, a microcode update that works around this erratum is available, nicely packaged by Debian, and ready to apply to our test machines. A few hours later, Joris Giovannangeli confirms that the crash is gone after upgrading the microcode. I run more tests on my shiny new Skylake-based workstation (courtesy of Inria’s procurement) and come to the same conclusion, since a test that crashes in less than 10 minutes with the old microcode runs 2.5 days without problems with the updated microcode.

Another reason to believe that SKL150 is the culprit is that the problematic code pattern outlined in this erratum is generated by GCC when compiling the OCaml run-time system. For example, in byterun/major_gc.c, function sweep_slice, we have C code like this:

hd = Hd_hp (hp);
/*...*/
Hd_hp (hp) = Whitehd_hd (hd);

After macro-expansion, this becomes:

hd = *hp;
/*...*/
*hp = hd & ~0x300;

Clang compile this code the obvious way, using only full-width registers:

movq    (%rbx), %rax
[...]
andq    $-769, %rax             # imm = 0xFFFFFFFFFFFFFCFF
movq    %rax, (%rbx)

However, gcc prefers to use the %ah 8-bit register to operate upon bits 8 to 15 of the full register %rax, leaving the other bits unchanged:

movq    (%rdi), %rax
[...]
andb    $252, %ah
movq    %rax, (%rdi)

The two codes are functionally equivalent. One possible reason for GCC’s choice of code is that it is more compact: the 8-bit constant $252 fits in 1 byte of code, while the 32-bit-extended-to-64-bit constant $-769 needs 4 bytes of code. At any rate, the code generated by GCC does use both %rax and %ah, and, depending on optimization level and bad luck, such code could end up in a loop small enough to trigger the SKL150 bug.

So, in the end, it was a hardware bug. Told you so!

Epilogue

Intel released microcode updates for Skylake and Kaby Lake processors that fix or work around the issue. Debian has detailed instructions to check whether your Intel processor is affected and how to obtain and apply the microcode updates.

The timing for the publication of the bug and the release of the microcode updates was just right, because several projects written in OCaml were starting to observe mysterious random crashes, for example Lwt, Coq, and Coccinelle.

The hardware bug is making the rounds in the technical Web sites, see for example Ars Technica, HotHardware, Tom’s Hardware, and Hacker’s News.