The story of a stack overflow during Coq extraction
- July 1, 2012
I routinely extract my Coq development into an executable I can test and benchmark. However, on one fateful afternoon, the extraction process failed with a stack overflow. I had to dive deep into the code. This is the fable of how things went right and wrong, moral of the story included.
While I was working on the new version of my alias analysis proof, I got back to a point where I could extract the code, having admitted some proofs for later. I did so to make sure I had the last version of the executable, and ran the usual make command, but that’s when things went wrong. Here is what I saw:
$ make all -j4 [...] coqtop -I lib -I common -I ia32/standard -I ia32 -I backend -I cfrontend -I driver -batch -load-vernac-source extraction/extraction.v [...] Stack overflow.
Now, that is not quite the enlightening error message. I had almost no way of finding out more about this without hacking into Coq. So I fired up my favorite editor and tried to turn this:
(* coq8.3pl4toplevelcerrrorsml *) | Stack_overflow -> hov 0 (str "Stack overflow")
into the hopefully more helpful:
| Stack_overflow -> hov 0 (str (Printexcget_backtrace ()))
I also quickly hacked the Makefile, adding
OCAMLRUNPARAM=b
, -byte
, -debug
and adding verbosity to the Vernacular load command:
COQEXEC=OCAMLRUNPARAM=b coqtop $(INCLUDES) -byte -debug -batch -load-vernac-source-verbose
Re-compiled Coq, re-ran the extraction, to get to the following result:
Raised at file "format.ml", line 194, characters 15-26 Called from file "format.ml", line 420, characters 8-33 Called from file "format.ml", line 435, characters 6-24
Meh. That is not quite interesting. format.ml is OCaml’s pretty-printing module, which unfortunately uses exceptions for its control-flow. It seems that we’ve run in the infamous problem of the overwritten stack trace. Luckily, having added the debug flag resulted in an additional line appearing in the output:
Vernac Interpreter Executing command Raised at file "format.ml", line 194, characters 15-26 [...]
Searching for this led me into:
(* coq8.3pl4toplevelvernacinterpml *) let call (opn,converted_args) = let loc = ref "Looking up command" in try let callback = vinterp_map opn in loc:= "Checking arguments"; let hunk = callback converted_args in loc:= "Executing command"; hunk() with | Drop -> raise Drop | e -> if !Flagsdebug then msgnl (str"Vernac Interpreter " ++ str !loc); raise e
Haha! Looks like my exception went through that catch-all handler and
has been re-raised after printing that line. This might well be the line
that called into format.ml
. Anyway, here’s my chance to
print the stack trace before it gets overwritten:
| e -> Printexcprint_backtrace stdout; if !Flagsdebug then msgnl (str"Vernac Interpreter " ++ str !loc); raise e
Let’s try that:
Raised by primitive operation at file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 Called from file "hashtbl.ml", line 58, characters 10-28 [... around 1000 such lines]
Now this was more interesting! Back into OCaml’s
hashtbl.ml
around this line:
(* ocaml3.12.1stdlibhashtblml *) (*49*) let resize hashfun tbl = (*50*) let odata = tbldata in (*51*) let osize = Arraylength odata in (*52*) let nsize = min (2 * osize + 1) Sysmax_array_length in (*53*) if nsize <> osize then begin (*54*) let ndata = Arraycreate nsize Empty in (*55*) let rec insert_bucket = function (*56*) Empty -> () (*57*) | Cons(key, data, rest) -> (*58*) insert_bucket rest; (* preserve original order of elements *) (*59*) let nidx = (hashfun key) mod nsize in (*60*) ndata(nidx) <- Cons(key, data, ndata(nidx)) in (*61*) for i = 0 to osize - 1 do (*62*) insert_bucket odata(i) (*63*) done; (*64*) tbldata <- ndata; (*65*) end
There is our stack overflow: it seems that we’re trying to resize a huge bucket in a non-tail-recursive fashion. Not ideal… These buckets serve two purposes in OCaml’s hash table implementation. One is the usual collision list, where a bucket contains all the elements of keys whose hashes collide; exhausting the stack with hash collisions is very unlikely if the hash function is good. It could be a bug caused by a faulty user-defined hash function, that has trivial collisions in some cases.
The second one is less well-known, occasionally helpful but otherwise
irritating: the Hashtbl
module implements multiple
bindings, that is the association of a single key with several
values. If you call Hashtbl.add k v
and the key
k
is already bound in the table, the old binding is not
lost, but a new binding to v
is added on top of it.
Hashtbl.remove k
will only remove the newest binding,
restoring the previous one. One can access all bindings with
Hashtbl.find_all
(find
only returns the last
one), and Hashtbl.iter
, Hashtbl.fold
and
Hashtbl.length
will consider all bindings of each key. If
you call Hasthbl.add
too much on a single key, you may fill
its corresponding bucket. If you don’t care about multiple bindings, you
should use Hasthbl.replace k v
instead, which removes the
earlier binding if present. So there may be an issue with the use of
Hashtbl.add
where Hashtbl.replace
should have
been preferred; or, if an application really needs a lot of multiple
bindings, maybe it should switch to a dedicated data structure such as
hashtables of lists of values.
Now, let’s try and see where this might be called. Of course, no code
in Coq calls resize
, since it is not exposed in
hashtbl.mli
. The possible suspects are thus
Hashtbl.add
and Hashtbl.replace
. Grepping for
these two gives back lots of results, but with some flair, one can guess
which ones are worth inspecting. In particular, these two seem to be
just in the right folder:
pluginsextractioncommonml: (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h) pluginsextractioncommonml:let add_visible ks l = Hashtbladd (top_visible ()).content ks l
A quick-and-dirty way to check that is to wrap the function like this:
(fun k v -> Printfprintf "foo\n"; Hashtbladd h k v)
Indeed, the first one was the culprit. Now, here come the good and the bad news. First, the bad, let’s see that line of code in context:
(* coq8.3pl4pluginsextractioncommonml *) let mktable autoclean = let h = Hashtblcreate 97 in if autoclean then register_cleanup (fun () -> Hashtblclear h); (Hashtbladd h, Hashtblfind h, fun () -> Hashtblclear h)
This rather generic hashtable-making-function is called in three places in that file, and it is not quite obvious to guess which caller is important, by whom it is called, and where the returned closures are evaluated.
The good news is: we don’t need to know this to fix that bug! Indeed,
the bug can be fixed right here, and abstraction can guarantee that the
fix is sound! Here is the trick (credits to Gabriel Scherer for noticing
this first): since the hash table created is closed over, but never
returned by mktable, the hash table may only be handled via the three
closures mktable returns. In particular, it only returns handles to
Hashtbl.add
, Hashtbl.find
and
Hashtbl.clear
. Hashtbl.add
is the only
function that can fill the buckets, and, the other two functions are
oblivious to multiple bindings. In other words, we can replace the call
to Hashtable.add
by a call to Hashtbl.replace
,
whose semantics is to replace the binding in place in the bucket rather
than stacking a new binding, and we know that this change will be
semantically transparent in the entire program!
Had Hashtable.find_all
been exposed – or any other
function observing multiple bindings – it could have been unsafe to do
this: these closures would have been semantically affected by the
change, and we would therefore need to check first that the callers of
mktable makes no use of these functions.
This fix proved successful, the extraction process went well. As a
side track, I tried to understand what caused the outstanding number of
calls to Hashtbl.add
to happen. It was painful, all I know
for now is that it seems to happen in a dry run of Coq’s extraction,
where some functions keep track of the context to see what’s in scope at
every definition point in a signature:
(* coq8.3pl4pluginsextractionocamlml *) and pp_module_type params = function (* [...] *) | MTsig (mp, sign) -> push_visible mp params; let l = map_succeed pp_specif sign in pop_visible (); (* this calls Hashtbladd if some conditions hold *)
As to why it blows up, I haven’t figured it out. It seems that the
same thing is being popped many times, leading to the construction of
the long collision list and the eventual stack overflow. It would be
good to fix this too, but that’s not my concern at the moment. I am not
even sure how to reproduce this bug: I guess one needs a big file with
several Module Type
s and Module
s, but the
exact conditions still elude me.
A bug report with an attached one-liner patch has been submitted to the Coq bugtracker. An existing bug report has been updated with a proposed patch in the OCaml bugtracker. This one is more questionable (is it a bug or a feature?), since the fix makes resize a bit slower and only delays problems for as long as it can shrink buckets to a size that doesn’t overflow the stack, since some other sub-functions of the Hashtbl module are also non-tail-recursive.
To conclude, some lessons learned (or rather, recalled) from all this:
Abstraction can help! Encapsulation of data structures behind a restricted interface can make invariants clear to your "customers". I still don’t know why the bug was triggered, yet I have fixed it already with confidence in the fact that it cannot change the "observable" behavior of the affected client code.
Catch-all exception handlers are annoying when they hide away details that end-users might want to know. Furthermore, code that may use exceptions for control-flow should be avoided early in the exception-handling process, unless you backed up the backtrace already.
On big input, you’d rather make your recursive functions tail-recursive. Some of the OCaml documentation makes it quite explicit whether functions are tail-recursive or not, which is a good thing to know.