Gagallium : Fixing stack overflows using continuations

Just a few weeks ago, Andrej Bauer wrote two articles ( 1 2 ) about implementing a dependent type theory in Ocaml. In particular, he demonstrates a cool idea to normalize expressions, called « normalization by evaluation ».

However, when trying to evaluate 2²⁰, the program overflows the stack. It turns out that there is an easy and methodic way to locate and fix such problems, which we will discuss in that post.

Edit: there is now a third article published in the series, but the code presented here is still based on the Part II branch of Andrej’s git repository.

First, let’s examine a similar, working example. From Andrej’s test, we can compute 2¹⁶:

$ make native && echo "Eval power two (power two four)." | ./tt.native -l church.tt
ocamlbuild -lib unix -use-menhir tt.native
Finished, 38 targets (38 cached) in 00:00:00.
tt master
[Type Ctrl-D to exit or "Help." for help.]
#     = fun A : Type 0 =>
      fun x : A -> A =>
      fun x0 : A =>
      x
      (x
       (x
        (x
         (x
          (x
           (x
            (x
             (x
              (x
               (x
                (x
                 (x
                  (x
                   (x
                    (x
                     (x
                      (x
                       (x
                        (x
                         (x
                          (x
                           (x
                            (x
                             (x
                              (x
                               (x (x (x (x (x (x (x (x (x (... (...))))))))))))))))))))))))))))))))))))
    : forall A : Type 0, (A -> A) -> A -> A

This is just the regular Church-encoding of a huge number: you can think of A as the natural numbers type, of x as the successor constructor, and x0 as zero. Notice how the result is a very deeply-nested sequence of applications. The shape being pretty-printed here relates closely to the manipulated datatype, and as we will see, it is the source of the overflow.

Let us now modify the Makefile to add debug information to get a proper backtrace:

 native:
-  ocamlbuild -lib unix -use-menhir tt.native
+  ocamlbuild -tag annot -tag debug -lib unix -use-menhir tt.native

Rebuilding and running it on 2²⁰ (setting OCAMLRUNPARAM=“b” to print the backtrace when the stack overflows):

$ make native && echo "Eval power two (times four five)." | OCAMLRUNPARAM="b" ./tt.native -l church.tt
ocamlbuild -tag annot -tag debug -lib unix -use-menhir tt.native
Finished, 38 targets (15 cached) in 00:00:00.
tt master
[Type Ctrl-D to exit or "Help." for help.]
# Fatal error: exception Stack_overflow
Raised by primitive operation at file "value.ml", line 74, characters 33-43
Called from file "value.ml", line 74, characters 33-43
Called from file "value.ml", line 90, characters 47-54
Called from file "value.ml", line 74, characters 33-43
Called from file "value.ml", line 90, characters 47-54
Called from file "value.ml", line 74, characters 33-43
Called from file "value.ml", line 90, characters 47-54
Called from file "value.ml", line 74, characters 33-43
[...]

The problem here is that the code tries to reify the result whose body contains the deep nesting of application nodes. When recursively processing a term, the call stack is proportional to the “height” of the term as a tree; our result here is a tree that is deeply skewed on the right, with height around 2²⁰. Location information tells us which recursive processing step fails (two mutually recursive calls), and the guilty function here is the following:

and reify_abstraction (x, t, f) =
  let x = Syntax.refresh x in
  (x, reify t, reify (f (Neutral (Var x))))

It appears that the call to f (Neutral (Var x)) blows the stack. To solve the problem (or at least this part of the problem), we must somehow put this call in tail-call position; it is currently surrounded by some post-treatment that builds the (x, reify t, ...) triple from the result of the call.

The idea of using continuation-passing-style (CPS) is to bluntly put the call in tail-call position, and pass it the “post-treatment” to be done as a function (continuation) called by f itself when it produces its result. “Don’t call us, we’ll call you”: this is an inversion of control.

and reify_abstraction (x, t, f) =
  let x = Syntax.refresh x in
  f (Neutral (Var x)) (fun body -> (x, reify t, reify body))

For typing reason (the reification is polymorphic), this simple style doesn’t quite work here, we must explicitly represent the continuation value with a first-class polymorphic type (as a record field);

and reify_abstraction (x, t, f) =
  let x = Syntax.refresh x in
  (f (Neutral (Var x))).call
    (fun body -> (x, reify t, reify body))

The type abstraction accepted by the function needs to be changed from:

and abstraction = variable * value * (value -> value)

into:

and abstraction =
  variable * value * (value -> value continuation)

and 'a continuation = { call : 'b . ('a -> 'b) -> 'b }

We have to fix some parts of the program to adapt to that change. When we before used an abstraction body directly as a function, f x we now need to pass it a continuation. In reify_abstraction we were carefull to pass just the right continuation to have a tail-call, but the other uses are not critical for our input, so we can just pass them the identity continuation, that immediately returns the result.

For example in equal_abstraction,

 and equal_abstraction (x, v1, f1) (_, v2, f2) =
   equal v1 v2 &&
    (let x = Neutral (Var (Syntax.refresh x)) in
      equal (f1 x) (f2 x))

The last line becomes:

      equal ((f1 x).call id) ((f2 x).call id))

The three other changes can be seen in the corresponding commit.

Let’s put this to test:

$ make native && echo "Eval power two (times four five)." | OCAMLRUNPARAM="b" ./tt.native -l church.tt
ocamlbuild -tag annot -tag debug -lib unix -use-menhir tt.native
Finished, 38 targets (30 cached) in 00:00:00.
tt master
[Type Ctrl-D to exit or "Help." for help.]
# Fatal error: exception Stack_overflow
Raised at file "list.ml", line 157, characters 19-30
Called from file "value.ml", line 78, characters 33-43
Called from file "value.ml", line 95, characters 47-54
Called from file "value.ml", line 78, characters 33-43
Called from file "value.ml", line 95, characters 47-54
Called from file "value.ml", line 78, characters 33-43
Called from file "value.ml", line 95, characters 47-54
Called from file "value.ml", line 78, characters 33-43
[...]

Not there yet: we have made one particular part of the program tail-recursive, but we still need to fix the rest. The next problem arises in the treatment of the right-hand side (RHS) of application:

and reify_neutral' = function
  | Var x -> Syntax.Var x
  | App (n, v) -> Syntax.App (reify_neutral n, reify v)

As the problem on our particular input comes from the right arguments of application nodes, the call we would like to make tail-recursive here is reify v.

This part of the program is less complicated from a typing point of view, so we don’t need the explicit record wrapping for polymorphism. However, the fix is larger as we we are in the middle of several mutually recursive functions (reify, reify', reify_neutral', reify_neutral) that all need to be made CPS somehow.

Fortunately, this can be done step-by-step. First turn this reify v call in a tail-call to a yet-to-be-defined reify_cont function:

| App (n, v) ->
    reify_cont v (fun v -> Syntax.App (reify_neutral n, v))

The simplest way to define reify_cont v ret (where ret is the name of the continuation, of which we can think of as a “return”) is the following:

and reify_cont v ret = ret (reify v)

Of course, this definition doesn’t solve anything, as it’s no more tail-recursive than the initial version. But we can iteratively refine this definition to expand the continuation-passing-style.

First, inline the definition of reify:

and reify_cont v ret =
  ret (Syntax.nowhere (reify' v))

Then turn this into a tail-call to a new reify'_cont function, naively defined. Note how the continuation passed to reify'_cont first applies the post-treatment of the reify function, then call the passed-in continuation ret.

and reify_cont v ret =
  reify'_cont v (fun v -> ret (Syntax.nowhere v))

and reify'_cont v ret = ret (reify' v)

It should now be routine: inline the call, then specialize the definition to expose a tail-call.

and reify'_cont v ret = match v with
  | Universe _ | Pi _ | Lambda _ -> ret (reify' v)
  | Neutral n -> reify_neutral'_cont n ret

Note that we have selectively handled the Neutral case, which is where the stack overflow happens and needs to be made tail-rec, while all the other cases reuse the non-CPS reify' function: adding CPS versions of our code duplicate some logic, but only in the code paths that need to be stack-preserving.

The definition of reify_neutral'_cont again follows the same principle, but chains together the CPS versions of reify and reify_neutral:

and reify_neutral'_cont n ret = match n with
  | Var x -> ret (Syntax.Var x)
  | App (n, v) ->
    reify_cont v (fun v ->
      reify_neutral_cont n (fun n ->
        ret (Syntax.App (n, v))))

and reify_neutral_cont n ret =
  reify_neutral'_cont n (fun n -> ret (Syntax.nowhere n))

This closes the loop: we have made a _cont version of all the functions involved in the mutual recursion that overflows the stack. And they were derived from their non-CPS definition in an simple and methodic way.

Of course, this is still not enough:

$ make native && echo "Eval power two (times four five)." | OCAMLRUNPARAM="b" ./tt.native -l church.tt
ocamlbuild -tag annot -tag debug -lib unix -use-menhir tt.native
Finished, 38 targets (30 cached) in 00:00:00.
tt master
[Type Ctrl-D to exit or "Help." for help.]
#     = Fatal error: exception Stack_overflow
Raised at file "pervasives.ml", line 149, characters 10-33
Called from file "beautify.ml", line 64, characters 14-41
Called from file "beautify.ml", line 53, characters 27-56
Called from file "beautify.ml", line 64, characters 14-41
Called from file "beautify.ml", line 53, characters 27-56
[...]

This time, we manage to reify the result, but encounter a new overflow in the part of the code that prepares the result for pretty-printing, more specifically and as expected, in the handling of the RHS of applications.

The same transformation method into CPS applies here again, as well as one step further in the pretty-printer code. Fixing each part of the code in this methodic way, if a bit repetitive, is rather quick.

The commits associated to the fixes described in this post can be found on Gabriel Scherer’s github repository.

Once this is done, we finally get a result (whose pretty-printing is not that interesting):

$ make native && echo "Eval power two (times four five)." | OCAMLRUNPARAM="b" ./tt.native -l church.tt                                                                                                       ocamlbuild -tag annot -tag debug -lib unix -use-menhir tt.native
Finished, 38 targets (28 cached) in 00:00:00.
tt master
[Type Ctrl-D to exit or "Help." for help.]
#     = fun A : Type 0 =>
      fun x : A -> A =>
      fun x0 : A =>
      x
      (x
       (x
        (x
         (x
          (x
           (x
            (x
             (x
              (x
               (x
                (x
                 (x
                  (x
                   (x
                    (x
                     (x
                      (x
                       (x
                        (x
                         (x
                          (x
                           (x
                            (x
                             (x
                              (x
                               (x (x (x (x (x (x (x (x (x (... (...))))))))))))))))))))))))))))))))))))
    : forall A : Type 0, (A -> A) -> A -> A
# %