Gagallium : GPU memory model

Following a discussion that occurred at Gallium, I recently wondered what was the kind of memory model exposed by a GPU (read: my awesome gamer’s GPU).

Most programmers assume what is called a “sequentially consistent” (SC) memory model: that is, parallel programs execute as if instructions from various threads were interleaved. Seasoned programmers will know that the memory model exhibited by modern multi-core processors is not SC, but much more arcane.

Luc, one of Gallium’s researchers, spends a considerable amount of time testing various kind of processors to unravel what memory model they expose, and whether or not a given processor is correct with respect to its “model”. The industry often provides informal textual documentation of models that is barely understandable. One better way to document a memory model is to describe what are the allowed results of so-called litmus tests.

A customary litmus-test is the following, where we execute two threads p1 and p2:

| p1      | p2      |
|---------+---------|
| x  <- 1 | y  <- 1 |
| r1 <- y | r2 <- x |

Here, x, y, r1, r2 are initialized to 0. Considering all possible interleavings of this code makes it clear that r1 or r2 (maybe both) must be equal to 1 at the end of the execution of p1 and p2. But, on non-SC architectures, it is also possible to see r1 = r2 = 0 at the end of the execution. (In the nomenclature defined in this paper, this test is called SB, which stands for “store buffering”.)

Another common test is the following

| p1      | p2       |
|---------+----------|
| x  <- 1 | r2  <- y |
| y  <- 1 | r1  <- x |

Here, the experimental situation makes it possible to determine whether or not the read-after-read and writes-after-writes dependencies are preserved. If r2 = 1 and r1 = 0, then it means that writes do not appear in program order. (In the aforementioned nomenclature, this test is named MP, for “message passing”.)

There are more complex tests, that involves things such as locked instructions, barriers and so on, but I am not an expert on this, and I secretly hope that Luc will jump in and write a blog post about it. I will rather come back on the subject of this post: is my GPU sequentially consistent?

I translated the above litmus tests in CUDA kernels, and wrote C programs that execute the litmus tests under various conditions till the relaxed behaviors appear. Then, I compiled these programs using the NVidia nvcc compiler, and ran them on my computer¹.

Let’s cut down the chase: as you may have guessed, the relaxed behaviors do appear. The interesting point is that for some values of the experimental parameters we choose, they may appear more or less often (and for some values, never). To be more specific, the experimental parameter I am speaking about is the amount of copies of the test that are run in parallel, and whether or not two threads that interact together are “close”: basically, we execute N times the same litmus test at once, which involves 2N threads. These threads are executed in batches called “warps” of 32 threads at once, and the behaviors we get seems to depend on whether interacting threads are in the same warp or not²

Of course, if I had read carefully the NVidia documentation (PTX isa 3.1, Section 5.1.4) beforehand, I would have found a less-than-cristal-clear explanation like the following one:

Global memory is not sequentially consistent. Consider the case where one thread executes the following two assignments:
a = a + 1;
b = b – 1;
If another thread sees the variable b change, the store operation updating a may still be in flight. This reiterates the kind of parallelism available in machines that run PTX.

But this would have made for a less dramatic blog post… Moreover, I hope that it helps convey the idea that memory models are a “formal” thing, that can be studied experimentally (as Luc does), but can also be formalized in a proof assistant (as Jade does).

I think it would be interesting to extend the kind of attention that was given to the memory models of CPUs to GPUs, in terms of testing and formalizations (the semantics of NVidia’s PTX ISA). Of course, I am just being curious here, since it is not my field of research. But if you are looking for a research internship and this interests you, you should definitely contact Luc ;).