Gagallium : The story of a stack overflow during Coq extraction

I routinely extract my Coq development into an executable I can test and benchmark. However, on one fateful afternoon, the extraction process failed with a stack overflow. I had to dive deep into the code. This is the fable of how things went right and wrong, moral of the story included.

While I was working on the new version of my alias analysis proof, I got back to a point where I could extract the code, having admitted some proofs for later. I did so to make sure I had the last version of the executable, and ran the usual make command, but that’s when things went wrong. Here is what I saw:

$ make all -j4
[...]
coqtop -I lib -I common -I ia32/standard -I ia32 -I backend -I cfrontend -I driver -batch -load-vernac-source extraction/extraction.v
[...]
Stack overflow.

Now, that is not quite the enlightening error message. I had almost no way of finding out more about this without hacking into Coq. So I fired up my favorite editor and tried to turn this:

(* coq8.3pl4toplevelcerrrorsml *)
| Stack_overflow ->
    hov 0 (str "Stack overflow")

into the hopefully more helpful:

  | Stack_overflow ->
      hov 0 (str (Printexcget_backtrace ()))

I also quickly hacked the Makefile, adding OCAMLRUNPARAM=b, -byte, -debug and adding verbosity to the Vernacular load command:

COQEXEC=OCAMLRUNPARAM=b coqtop $(INCLUDES) -byte -debug -batch -load-vernac-source-verbose

Re-compiled Coq, re-ran the extraction, to get to the following result:

Raised at file "format.ml", line 194, characters 15-26
Called from file "format.ml", line 420, characters 8-33
Called from file "format.ml", line 435, characters 6-24

Meh. That is not quite interesting. format.ml is OCaml’s pretty-printing module, which unfortunately uses exceptions for its control-flow. It seems that we’ve run in the infamous problem of the overwritten stack trace. Luckily, having added the debug flag resulted in an additional line appearing in the output:

Vernac Interpreter Executing command
Raised at file "format.ml", line 194, characters 15-26
[...]

Searching for this led me into:

(* coq8.3pl4toplevelvernacinterpml *)
let call (opn,converted_args) =
  let loc = ref "Looking up command" in
  try
    let callback = vinterp_map opn in
    loc:= "Checking arguments";
    let hunk = callback converted_args in
    loc:= "Executing command";
    hunk()
  with
    | Drop -> raise Drop
    | e ->
        if !Flagsdebug then
          msgnl (str"Vernac Interpreter " ++ str !loc);
        raise e

Haha! Looks like my exception went through that catch-all handler and has been re-raised after printing that line. This might well be the line that called into format.ml. Anyway, here’s my chance to print the stack trace before it gets overwritten:

    | e ->
        Printexcprint_backtrace stdout;
        if !Flagsdebug then
          msgnl (str"Vernac Interpreter " ++ str !loc);
        raise e

Let’s try that:

Raised by primitive operation at file "hashtbl.ml", line 58, characters 10-28
Called from file "hashtbl.ml", line 58, characters 10-28
Called from file "hashtbl.ml", line 58, characters 10-28
Called from file "hashtbl.ml", line 58, characters 10-28
[... around 1000 such lines]

Now this was more interesting! Back into OCaml’s hashtbl.ml around this line:

        (* ocaml3.12.1stdlibhashtblml *)
(*49*)  let resize hashfun tbl =
(*50*)    let odata = tbldata in
(*51*)    let osize = Arraylength odata in
(*52*)    let nsize = min (2 * osize + 1) Sysmax_array_length in
(*53*)    if nsize <> osize then begin
(*54*)      let ndata = Arraycreate nsize Empty in
(*55*)      let rec insert_bucket = function
(*56*)          Empty -> ()
(*57*)        | Cons(key, data, rest) ->
(*58*)            insert_bucket rest; (* preserve original order of elements *)
(*59*)            let nidx = (hashfun key) mod nsize in
(*60*)            ndata(nidx) <- Cons(key, data, ndata(nidx)) in
(*61*)      for i = 0 to osize - 1 do
(*62*)        insert_bucket odata(i)
(*63*)      done;
(*64*)      tbldata <- ndata;
(*65*)    end

There is our stack overflow: it seems that we’re trying to resize a huge bucket in a non-tail-recursive fashion. Not ideal… These buckets serve two purposes in OCaml’s hash table implementation. One is the usual collision list, where a bucket contains all the elements of keys whose hashes collide; exhausting the stack with hash collisions is very unlikely if the hash function is good. It could be a bug caused by a faulty user-defined hash function, that has trivial collisions in some cases.

The second one is less well-known, occasionally helpful but otherwise irritating: the Hashtbl module implements multiple bindings, that is the association of a single key with several values. If you call Hashtbl.add k v and the key k is already bound in the table, the old binding is not lost, but a new binding to v is added on top of it. Hashtbl.remove k will only remove the newest binding, restoring the previous one. One can access all bindings with Hashtbl.find_all (find only returns the last one), and Hashtbl.iter, Hashtbl.fold and Hashtbl.length will consider all bindings of each key. If you call Hasthbl.add too much on a single key, you may fill its corresponding bucket. If you don’t care about multiple bindings, you should use Hasthbl.replace k v instead, which removes the earlier binding if present. So there may be an issue with the use of Hashtbl.add where Hashtbl.replace should have been preferred; or, if an application really needs a lot of multiple bindings, maybe it should switch to a dedicated data structure such as hashtables of lists of values.

Now, let’s try and see where this might be called. Of course, no code in Coq calls resize, since it is not exposed in hashtbl.mli. The possible suspects are thus Hashtbl.add and Hashtbl.replace. Grepping for these two gives back lots of results, but with some flair, one can guess which ones are worth inspecting. In particular, these two seem to be just in the right folder:

pluginsextractioncommonml:  (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h)
pluginsextractioncommonml:let add_visible ks l = Hashtbladd (top_visible ()).content ks l

A quick-and-dirty way to check that is to wrap the function like this:

    (fun k v -> Printfprintf "foo\n"; Hashtbladd h k v)

Indeed, the first one was the culprit. Now, here come the good and the bad news. First, the bad, let’s see that line of code in context:

(* coq8.3pl4pluginsextractioncommonml *)
let mktable autoclean =
  let h = Hashtblcreate 97 in
  if autoclean then register_cleanup (fun () -> Hashtblclear h);
  (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h)

This rather generic hashtable-making-function is called in three places in that file, and it is not quite obvious to guess which caller is important, by whom it is called, and where the returned closures are evaluated.

The good news is: we don’t need to know this to fix that bug! Indeed, the bug can be fixed right here, and abstraction can guarantee that the fix is sound! Here is the trick (credits to Gabriel Scherer for noticing this first): since the hash table created is closed over, but never returned by mktable, the hash table may only be handled via the three closures mktable returns. In particular, it only returns handles to Hashtbl.add, Hashtbl.find and Hashtbl.clear. Hashtbl.add is the only function that can fill the buckets, and, the other two functions are oblivious to multiple bindings. In other words, we can replace the call to Hashtable.add by a call to Hashtbl.replace, whose semantics is to replace the binding in place in the bucket rather than stacking a new binding, and we know that this change will be semantically transparent in the entire program!

Had Hashtable.find_all been exposed – or any other function observing multiple bindings – it could have been unsafe to do this: these closures would have been semantically affected by the change, and we would therefore need to check first that the callers of mktable makes no use of these functions.

This fix proved successful, the extraction process went well. As a side track, I tried to understand what caused the outstanding number of calls to Hashtbl.add to happen. It was painful, all I know for now is that it seems to happen in a dry run of Coq’s extraction, where some functions keep track of the context to see what’s in scope at every definition point in a signature:

(* coq8.3pl4pluginsextractionocamlml *)
and pp_module_type params = function
(* [...] *)
    | MTsig (mp, sign) ->
      push_visible mp params;
      let l = map_succeed pp_specif sign in
      pop_visible (); (* this calls Hashtbladd if some conditions hold *)

As to why it blows up, I haven’t figured it out. It seems that the same thing is being popped many times, leading to the construction of the long collision list and the eventual stack overflow. It would be good to fix this too, but that’s not my concern at the moment. I am not even sure how to reproduce this bug: I guess one needs a big file with several Module Types and Modules, but the exact conditions still elude me.

A bug report with an attached one-liner patch has been submitted to the Coq bugtracker. An existing bug report has been updated with a proposed patch in the OCaml bugtracker. This one is more questionable (is it a bug or a feature?), since the fix makes resize a bit slower and only delays problems for as long as it can shrink buckets to a size that doesn’t overflow the stack, since some other sub-functions of the Hashtbl module are also non-tail-recursive.

To conclude, some lessons learned (or rather, recalled) from all this:

Abstraction can help! Encapsulation of data structures behind a restricted interface can make invariants clear to your "customers". I still don’t know why the bug was triggered, yet I have fixed it already with confidence in the fact that it cannot change the "observable" behavior of the affected client code.
Catch-all exception handlers are annoying when they hide away details that end-users might want to know. Furthermore, code that may use exceptions for control-flow should be avoided early in the exception-handling process, unless you backed up the backtrace already.
On big input, you’d rather make your recursive functions tail-recursive. Some of the OCaml documentation makes it quite explicit whether functions are tail-recursive or not, which is a good thing to know.