I routinely extract my Coq development into an executable I can test and benchmark. However, on one fateful afternoon, the extraction process failed with a stack overflow. I had to dive deep into the code. This is the fable of how things went right and wrong, moral of the story included.
While I was working on the new version of my alias analysis proof, I got back to a point where I could extract the code, having admitted some proofs for later. I did so to make sure I had the last version of the executable, and ran the usual make command, but that's when things went wrong. Here is what I saw:
$ make all -j4 [...] coqtop -I lib -I common -I ia32/standard -I ia32 -I backend -I cfrontend -I driver -batch -load-vernac-source extraction/extraction.v [...] Stack overflow.
Now, that is not quite the enlightening error message. I had almost no way of finding out more about this without hacking into Coq. So I fired up my favorite editor and tried to turn this:
(* coq8.3pl4toplevelcerrrorsml *) | Stack_overflow -> hov 0 (str "Stack overflow")
into the hopefully more helpful:
| Stack_overflow -> hov 0 (str (Printexcget_backtrace ()))
I also quickly hacked the Makefile, adding
and adding verbosity to the Vernacular load command:
COQEXEC=OCAMLRUNPARAM=b coqtop $(INCLUDES) -byte -debug -batch -load-vernac-source-verbose
Re-compiled Coq, re-ran the extraction, to get to the following result:
Raised at file "format.ml", line 194, characters 15-26 Called from file "format.ml", line 420, characters 8-33 Called from file "format.ml", line 435, characters 6-24
Meh. That is not quite interesting. format.ml is OCaml's pretty-printing module, which unfortunately uses exceptions for its control-flow. It seems that we've run in the infamous problem of the overwritten stack trace. Luckily, having added the debug flag resulted in an additional line appearing in the output:
Vernac Interpreter Executing command Raised at file "format.ml", line 194, characters 15-26 [...]
Searching for this led me into:
(* coq8.3pl4toplevelvernacinterpml *) let call (opn,converted_args) = let loc = ref "Looking up command" in try let callback = vinterp_map opn in loc:= "Checking arguments"; let hunk = callback converted_args in loc:= "Executing command"; hunk() with | Drop -> raise Drop | e -> if !Flagsdebug then msgnl (str"Vernac Interpreter " ++ str !loc); raise e
Haha! Looks like my exception went through that catch-all handler and has been
re-raised after printing that line. This might well be the line that called
format.ml. Anyway, here's my chance to print the stack trace before it
| e -> Printexcprint_backtrace stdout; if !Flagsdebug then msgnl (str"Vernac Interpreter " ++ str !loc); raise e
Let's try that:
Raised by primitive operation at file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 [... around 1000 such lines]
Now this was more interesting! Back into OCaml's
hashtbl.ml around this line:
(* ocaml3.12.1stdlibhashtblml *) (*49*) let resize hashfun tbl = (*50*) let odata = tbldata in (*51*) let osize = Arraylength odata in (*52*) let nsize = min (2 * osize + 1) Sysmax_array_length in (*53*) if nsize <> osize then begin (*54*) let ndata = Arraycreate nsize Empty in (*55*) let rec insert_bucket = function (*56*) Empty -> () (*57*) | Cons(key, data, rest) -> (*58*) insert_bucket rest; (* preserve original order of elements *) (*59*) let nidx = (hashfun key) mod nsize in (*60*) ndata(nidx) <- Cons(key, data, ndata(nidx)) in (*61*) for i = 0 to osize - 1 do (*62*) insert_bucket odata(i) (*63*) done; (*64*) tbldata <- ndata; (*65*) end
There is our stack overflow: it seems that we're trying to resize a huge bucket in a non-tail-recursive fashion. Not ideal... These buckets serve two purposes in OCaml's hash table implementation. One is the usual collision list, where a bucket contains all the elements of keys whose hashes collide; exhausting the stack with hash collisions is very unlikely if the hash function is good. It could be a bug caused by a faulty user-defined hash function, that has trivial collisions in some cases.
The second one is less well-known, occasionally helpful but otherwise
Hashtbl module implements multiple bindings, that
is the association of a single key with several values. If you call
Hashtbl.add k v and the key
k is already bound in the table, the
old binding is not lost, but a new binding to
v is added on top of
Hashtbl.remove k will only remove the newest binding, restoring
the previous one. One can access all bindings with
find only returns the last one), and
Hashtbl.length will consider all bindings of each key. If you
Hasthbl.add too much on a single key, you may fill its
corresponding bucket. If you don't care about multiple bindings, you
Hasthbl.replace k v instead, which removes the earlier
binding if present. So there may be an issue with the use of
Hashtbl.replace should have been preferred; or,
if an application really needs a lot of multiple bindings, maybe it
should switch to a dedicated data structure such as hashtables of
lists of values.
Now, let's try and see where this might be called. Of course, no code in Coq
resize, since it is not exposed in
hashtbl.mli. The possible suspects
Hashtbl.replace. Grepping for these two gives back
lots of results, but with some flair, one can guess which ones are worth
inspecting. In particular, these two seem to be just in the right folder:
pluginsextractioncommonml: (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h) pluginsextractioncommonml:let add_visible ks l = Hashtbladd (top_visible ()).content ks l
A quick-and-dirty way to check that is to wrap the function like this:
(fun k v -> Printfprintf "foo\n"; Hashtbladd h k v)
Indeed, the first one was the culprit. Now, here come the good and the bad news. First, the bad, let's see that line of code in context:
(* coq8.3pl4pluginsextractioncommonml *) let mktable autoclean = let h = Hashtblcreate 97 in if autoclean then register_cleanup (fun () -> Hashtblclear h); (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h)
This rather generic hashtable-making-function is called in three places in that file, and it is not quite obvious to guess which caller is important, by whom it is called, and where the returned closures are evaluated.
The good news is: we don't need to know this to fix that bug! Indeed, the bug
can be fixed right here, and abstraction can guarantee that the fix is sound!
Here is the trick (credits to Gabriel Scherer for noticing this first): since
the hash table created is closed over, but never returned by mktable, the hash
table may only be handled via the three closures mktable returns. In
particular, it only returns handles to
Hashtbl.add is the only function that can fill the buckets,
and, the other two functions are oblivious to multiple bindings. In other
words, we can replace the call to
Hashtable.add by a call to
Hashtbl.replace, whose semantics is to replace the binding in place in the
bucket rather than stacking a new binding, and we know that this change will be
semantically transparent in the entire program!
Hashtable.find_all been exposed -- or any other function observing
multiple bindings -- it could have been unsafe to do this: these
closures would have been semantically affected by the change, and we
would therefore need to check first that the callers of mktable makes
no use of these functions.
This fix proved successful, the extraction process went well. As a side track,
I tried to understand what caused the outstanding number of calls to
Hashtbl.add to happen. It was painful, all I know for now is that it seems to
happen in a dry run of Coq's extraction, where some functions keep track of the
context to see what's in scope at every definition point in a signature:
(* coq8.3pl4pluginsextractionocamlml *) and pp_module_type params = function (* [...] *) | MTsig (mp, sign) -> push_visible mp params; let l = map_succeed pp_specif sign in pop_visible (); (* this calls Hashtbladd if some conditions hold *)
As to why it blows up, I haven't figured it out. It seems that the same thing
is being popped many times, leading to the construction of the long collision
list and the eventual stack overflow. It would be good to fix this too, but
that's not my concern at the moment. I am not even sure how to reproduce this
bug: I guess one needs a big file with several
Module Types and
but the exact conditions still elude me.
A bug report with an attached one-liner patch has been submitted to the Coq bugtracker. An existing bug report has been updated with a proposed patch in the OCaml bugtracker. This one is more questionable (is it a bug or a feature?), since the fix makes resize a bit slower and only delays problems for as long as it can shrink buckets to a size that doesn't overflow the stack, since some other sub-functions of the Hashtbl module are also non-tail-recursive.
To conclude, some lessons learned (or rather, recalled) from all this:
Abstraction can help! Encapsulation of data structures behind a restricted interface can make invariants clear to your "customers". I still don't know why the bug was triggered, yet I have fixed it already with confidence in the fact that it cannot change the "observable" behavior of the affected client code.
Catch-all exception handlers are annoying when they hide away details that end-users might want to know. Furthermore, code that may use exceptions for control-flow should be avoided early in the exception-handling process, unless you backed up the backtrace already.
On big input, you'd rather make your recursive functions tail-recursive. Some of the OCaml documentation makes it quite explicit whether functions are tail-recursive or not, which is a good thing to know.