Gagallium : Bytes and pointers in CompCert

Let me start this post by briefly introducing myself: I am a PhD student at the Radboud University Nijmegen, the Netherlands, working on developing a formal semantics for the C11 standard in Coq. Between January and March, I have been a guest at Gallium (many thanks to Xavier Leroy for inviting me!). During this stay I have been able to do some CompCert hacking, and to get more experience with CompCert and its semantics.

Under-specification in C

Every C compiler has some freedom on how to implement the C programming language as certain constructs are under-specified by the C standard. The standard uses the following notions of under-specification:

Unspecified behavior. Constructs for which the standard provides two or more possibilities. For example: order of evaluation in expressions. The behavior may vary for each use of the construct.
Implementation defined behavior. Similar to unspecified behavior, but the compiler has to document its choice. For example: size of integer types and endianness.
Undefined behavior. Constructs for which the standard imposes no requirements at all. For example: dereferencing a NULL-pointer and signed integer overflow.

Under-specification is used extensively to make C portable, and to allow compilers to generate faster code. For example, when dereferencing a pointer, no code has to be generated to check whether the pointer is valid. If the pointer is invalid (it is NULL or a dangling pointer), the program may do something arbitrary instead of having to exit with a nice error message.

Of course, CompCert also makes use of under-specification. Firstly, it makes specific choices for certain constructs. For example, it chooses a specific representation of integers and uses specific sizes for the various integer types. Secondly, it makes some undefined behavior defined, for example, it lets signed integer overflow wrap around. Remark that for many other constructs, it really imposes undefined behavior. For example, dereferencing an invalid pointer is undefined behavior by the CompCert semantics.

To verify properties of programs that are being compiled by CompCert, one can make explicit use of the behaviors that are defined by CompCert but not by the C standard. However, as I wish to use my own C semantics to verify programs compiled by any C compiler, it should capture the behavior of any C compiler that implements the C standard. Therefore, I have to take all under-specification seriously, whereas CompCert may (and even has to) make specific choices. See for example this post by John Regehr for some examples of crazy behavior by actual compilers due to undefined behavior, to get an idea of why taking undefined behavior is necessary.

For compilers like gcc and clang, I will of course not be able to give any formal guarantees that a correctness proof with respect to my semantics actually ensures correctness when compiled. After all, these compilers do not have a formal semantics. I could only argue that my operational semantics makes more things undefined than the C standard, and assuming these compilers “implement the C standard”, correctness should morally follow.

As a more formal means of validation of my semantics I intend to prove that CompCert is an instance of my semantics in future work. In order for this correspondence to become provable, this post will be about making more behavior defined in CompCert. I will discuss two patches that make previously undefined behaviors in the CompCert semantics defined. These behaviors should be defined according to the C standard. A next post will be about making more behavior undefined (sequence point violations).

Pointers in CompCert

CompCert defines its memory as finite map of blocks, each block consisting of an array of bytes. Pointers are pairs (b,ofs) where b identifies the block, and ofs the offset into that block.

The C standard’s way of dealing with pointer equality is a bit nasty. Let us take a look at the following excerpt from the C11 standard (6.5.9p5).

Two pointers compare equal if and only if […] or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

These pointers one past the end of a block are somewhat strange, as they cannot be dereferenced. Nonetheless, their use is common programming practice when looping through arrays. For example:

void map(int *p, int n, int f(int)) {
  int *end = p + n;
  while (p < end) {
    *p = f(*p);
    p++;
  }
}

Unfortunately, they can also be abused:

int main() {
  int x, y;
  if (&x + 1 == &y)
    printf("x and y are allocated adjacently\n");
}

Here, the printf is executed only if x and y are allocated adjacently, which may actually happen as many compilers allocate x and y consecutively on the stack.

In the semantics of CompCert C, x and y will be given disjoint block identifiers, and the pointer representations of &x + 1 and &y will thus be unequal. This inequality is not preserved by the compiler, however. During stack allocation, the blocks of x and y will be combined, and the representation of &x + 1 may become equal to &y. Hence, to ensure that pointer comparisons are preserved under compilation, CompCert used to make comparison of two pointers defined if and only if both are valid. Here, a pointer is valid if its offset is strictly within the block bounds.

In my patch for CompCert, this restriction is slightly weakened.

Comparison of two pointers in the same block is defined if and only if both are weakly valid. A pointer is weakly valid if it is valid, or “one past the end of” its block.
Comparison of two pointers with different blocks is defined if and only if both are valid.

This modification allows common programming practice when looping through arrays, but weird comparisons, as the one in the example above, remain undefined. This is necessary to ensure that pointer comparisons are preserved under compilation. I therefore think the above definition of pointer comparisons is more sensible than the one in the C standard (notice that the C standard already makes a distinction between pointers in the same block and pointers in different blocks, for pointer inequalities < and <=.)

To adapt the compiler correctness proofs I had to show that all compilation passes preserve weak validity of pointers and preserve the new definition of pointer comparisons. To make this happen, I had to modify the definition of memory injections accordingly.

Bytes in CompCert

Whereas CompCert uses sequences of bits to represent bytes constituting integer or floating point values, bytes constituting fragments of pointers are treated symbolically. These bytes are of the shape “fragment i of pointer (b,ofs)”. So, when storing a pointer (b,ofs) in memory, 4 bytes “fragment 0 of pointer (b,ofs)”, …, “fragment 3 of pointer (b,ofs)” are written.

Beforehand, it was only possible to read a sequence of such bytes as a pointer value, whereas single pointer fragments could not be read as a value of type unsigned char. Trying to do so resulted in undefined behavior. As a result, it was impossible to implement a used-defined version of memcpy in CompCert C that works for data structures containing pointers.

I created an extension that gives a semantics to byte-wise reading and writing of pointer fragments. This is achieved by extending the data type of values with an additional construct Vptrseg for a pointer fragments.

Inductive val: Type :=
  | Vundef: val
  | Vint: int -> val
  | Vfloat: float -> val
  | Vptr: block -> int -> val
  | Vptrseg: block -> int -> nat -> val.

The tricky part is how to deal with arithmetical operations and casts on pointer fragment values. In my patch I made all arithmetical operations (addition, shifts, …) on pointer fragments undefined.

Casts are slightly more involved as they are implicitly performed before an assignment. Hence, to enable writing of pointer fragments, casts from character type to character type should be made defined on pointer fragments. Since CompCert C is untyped, there is no guarantee that the result of the operand of a cast from type A to B is indeed a valid value of A. However, as the compiler needs to know that the result of such a cast is indeed a valid value of B, casts from character to character type really have to do something. Previously, it always performed a zero or sign extension.

For the case of pointer fragments, no definition of zero and sign extension can be given (with reasonable properties). Hence, in order to overcome this problem, the semantics of a cast in my patch checks whether the operand of a character to character cast is indeed an allowed value. If not, it results in undefined behavior. This has the desired result that character to character casts before assignments can be removed in a later compilation phase and byte-wise reading and writing of pointer fragments is made possible.

If CompCert C would get a type system in future versions that ensures that the result of an operand of a cast is indeed a valid value of the given type, this trick can be removed.

Example

Now that these two behaviors are defined, my extension of CompCert gives a semantics to the following implementation of memcpy.

void my_memcpy (void *p, void *q, int n) {
  // r may be one past
  unsigned char *r = (unsigned char*)p + n;
  while (p < r) {
    *((unsigned char*)p++) = *((unsigned char*)q++);
  }
}

int main() {
  struct S { short x; short *p; } s = { 10, &s.x }, s2;
  my_memcpy (&s2, &s, sizeof(struct S));
  return *s2.p;
}

Sources

The recently released CompCert 1.13 contains my patch for pointers one past the end of a block. The sources of my other extensions can be found at my github repository.